bioinformatics

 
Home

syllabus

general information

homework

lectures

websites

Will Terzaghi's Homepage



Membership

Login

 
 

week 7 homework

Biology 398INA: Topics in Bioinformatics
Homework # 7
Using gene-finding programs
Due March 1, 2003

Please send me your answers by email. You can either create a new file, or download the ms word file and type in your answers.

"week7homework.doc"

Part I: Finding specific types of prokaryotic sequences

1. Why is it relatively easy to find genes in bacterial genomes?
2. Go to http://www.tigr.org/software/transterm.html

    • What sorts of terminators does transterm find?
    • How many did it find in the Aquifex aeolicus genome?
    • About how many did it find in the escherichia coli genome?
    3. Go to Entrez ( http://www.ncbi.nlm.nih.gov/entrez/) and search “nucleotide” for AE007956 ( a portion of the genome of the bacterium Agrobacterium tumefasciens).
    • Copy the complete DNA sequence and paste it into either a MacVector file (Macvector should automatically remove all the numbers & spaces) or into a word processor file and use the replace command to remove all numbers and spaces.
    4. Now go to http://www.fruitfly.org/seq_tools/promoter.html
      • paste the edited sequence into the window, then select “Type of organism: prokaryotic,” “Include reverse strand? yes” and submit.
      • How many promoters does it find on the top strand?
      • Where are the three best promoters?
      • How many promoters does it find on the bottom strand?
      • Where are the three best promoters?
      5. Go to http://www.softberry.com/berry.phtml?topic=bprom
        • How accurate is BPROM?
        • How far is it from the most bacterial promoters to the protein coding sequences?
      6. Go to http://www.softberry.com/berry.phtml?topic=gfindb , give your query a name, then paste the edited sequence into the window.
        • Select “BPROM,” and “Escherichia coli K-12 “ as closest organism, then “perform search.”
        • How many promoters does it find on the top strand?
        • Where are the three best promoters?
        • How many promoters does it find on the bottom strand?
        • Where are the three best promoters?
        • How do these compare with the previous site?

      Part II: Finding prokaryotic genes
      1. Hit the “back” button This should return you to http://www.softberry.com/berry.phtml?topic=gfindb
      2. Select “fgenesB,” and “Escherichia coli K-12 “ as closest organism, then “perform search.”
      • How many genes does it find?
      • How many are on the + strand?
      • How many are on the - strand?
      • Does it find any genes not listed on the GenBank Flatfile?
      • Do the starts and stops agree with those listed on the GenBank flatfile? If not, which ones are different?
      3. Now go to http://opal.biology.gatech.edu/GeneMark/
      • How does GeneMark find genes?
      • Now click on the hypertext under “For prokaryotic genomic DNA analysis, you can use the parallel combination of the GeneMark and GeneMark.hmm programs”
        • paste the edited sequence from step I.3 into the window, then select “species: Escherichia coli K-12,” deselect “generate postscript graphics” then click Start GeneMark.HMM
      • How many genes does it find?
      • How many are on the + strand?
      • How many are on the - strand?
      • Does it find any genes not listed on the GenBank Flatfile?
      • Do the starts and stops agree with those listed on the GenBank flatfile? If not, which ones are different?

      Part III: Finding specific types of eukaryotic sequences
      1. Why would you want to find a eukaryotic promoter?
      2. What general approaches are used to find eukaryotic promoters?
      3. What is a TATAA box, and why is it useful for finding promoters?
      4. What other sorts of sequences are useful for finding eukaryotic promoters?
      5. Go to Entrez ( http://www.ncbi.nlm.nih.gov/Entrez ) and select nucleotide
      • Type in "AC068324"
      • Copy bases 60,001 to 70,020 and save as a MacVector file (Macvector should automatically remove all the numbers & spaces). Alternatively: save as an MS word file and use the replace command to remove all numbers and spaces.
      6. Go to http://bimas.dcrt.nih.gov/cgi-bin/molbio/signal
      • What does Signal Scan do?
      • Paste your edited sequence into the input window, select “plant” then click “submit.”
      • How many binding sites does it find for TBP (Tataa Binding Protein?)
      7. Go to http://bimas.dcrt.nih.gov/molbio/proscan/index.html and analyze your sequence
      • How does proscan find promoters?
      • How many promoters does it find?
      • Where are the TATAA boxes?
      8. Now try again at http://www.fruitfly.org/seq_tools/promoter.html
      • How does NNPP find promoters?
      • paste the edited sequence into the window, then select “Type of organism: eukaryotic,” “Include reverse strand? yes” and submit.
      • How many promoters does it find?
      • Where are the TATAA boxes?
      • How does this compare with ProScan from III.7?
      9. Now let's look for splice sites at http://www.fruitfly.org/cgi-bin/seq_tools/splice.html
        • Why look for splice sites?
        • What is a splice donor site?
        • What is a splice acceptor site ?
        • How many splice donor sites does it find?
        • How many splice acceptor sites?
      10. Now let's look for splice sites at http://www.cbs.dtu.dk/services/NetPGene/
      • How does NetPGene find splice sites?
      • You will need to convert your file to FASTA format by adding a line at the front of your file that starts with a >, then you need a line break before starting on your DNA sequence
      >Arabidopsis query
      aattggtcgtagctaggccataagc......
      • How many splice donor sites does it find?
      • How many splice acceptor sites?
      11. Now let's look for splice sites at http://bioinformatics.iastate.edu/cgi-bin/sp.cgi
        • How does this program find splice sites?
        • How many splice donor sites does it find?
        • How many splice acceptor sites?
      12. Now let's look for polyA sites at http://argon.cshl.org/tabaska/polyadq_inst.html
        • Why look for polyA sites?
        • How does this program find polyA sites?
        • Note that this site also needs FASTA format
        • How many polyA sites does it find?
      13. Now let's look for repeated sequences at http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker
      • Why look for repeated sequences?
      • How does this program find them?
      • Note that this site also needs FASTA format
      • be sure to specify that this is an Arabidopsis sequence
      • How many repeats does it find?

      Part IV: Finding eukaryotic genes

      1. What is the general approach used to find eukaryotic genes?
      2. Go to http://www.softberry.com/berry.phtml?topic=gfind";
      • paste in your sequence and analyze it using FGENEP /Multiple genes structure prediction in plant genomic DNA (specify that it is a plant sequence).
      • How many genes did it find?
      • How many exons?
      • Where do the genes start and finish?
      • How long are the coding sequences?
      • How do they compare with the Genbank entry (remember to subtract 60,000)?
      • Now analyze the sequence with FEX
      • How many exons does it find?
      • Why do you think that this number is different?
      • Now analyze the sequence with SPL
      • How many splice donor sites does it find?
      • How many splice acceptor sites?
      • How do these numbers compare with the numbers found in steps III.9, III.10 and III.11?
      • Why do you think that they vary?
      • Now analyze the sequence with PolyAH.
      • How many sites does it find?
      • How do these numbers compare with the numbers found in step III.12?
      • Now analyze the sequence with TSSW.
      • How many promoters does it find?
      • Where are the TATAA boxes?
      • How do these numbers compare with the numbers found in step III.7 and III.8?
      3. Now let's analyze the sequence with GENSCAN http://genes.mit.edu/GENSCAN.html
      • be sure to specify that it is an Arabidopsis sequence
      • How many genes does it find?
      • How many exons are in each?
      • How long are the proteins?
      • How do they compare with FGENEP ?
      • How do they compare with the Genbank entry (remember to subtract 60,000)?
      4. Now let's analyze the sequence with Webgene http://www.itba.mi.cnr.it/webgene
      • Select GeneBuilder
      • Be sure to specify Arabidopsis as your organism, and to use both GeneBuilder and GenView.
      • How many genes does it find?
      • How do they compare with the previous programs?
      • How many exons are in each gene it finds?
      • How do they compare with the previous programs?
      • How many Tataa boxes?
      • How do they compare with the previous programs?
      • How many polyA sites?
      • How do they compare with the previous programs?
      • How long are the proteins?
      • How do they compare with the previous programs?
      • How do they compare with the Genbank entry (remember to subtract 60,000)?
      5. How would you choose between the different answers you have been given by these various programs?
      6. Go to the week 7 websites page, pick two gene-finding programs which we haven’t used yet (or find two of your own, e.g. at http://linkage.rockefeller.edu/wli/gene/programs.html) then tell me who you picked, how they find genes, what genes they found in our test sequence and how they compare with the Genbank entry.




Last update: Friday, February 28, 2003 at 9:14:01 AM.