bioinformatics

 
Home

syllabus

general information

homework

lectures

websites

Will Terzaghi's Homepage



Membership

Login

 
 

week 2 lecture

Using databases

Specific queries

  • by name (or text)
  • by sequence alignment
  • by structure alignment

Browsing

  • by organism
  • by key word
  • by chromosome

Data mining

  • The exploration & analysis of large quantities of data in order to discover meaningful (and previously unknown) patterns
  • A major division of computer science!

Calvin:

Data mining

An introduction to the field is provided at http://www.the-data-mine.com/

A journal devoted to the field can be read at http://www.digimine.com/usama/datamine/

Lots of information and links to many aspects of data mining are posted at http://www.kdnuggets.com/software/index.html

The "tools for data mining" at NCBI describes the services available at NCBI http://www.ncbi.nlm.nih.gov/Tools/index.html

http://industry.ebi.ac.uk/~brazma/dm.html provides information about biological datamining

Another good source of information is: http://bioinformatics.weizmann.ac.il/cards/knowledge.html

Many companies make their living mining biological databases

http://www.gene.com/gene/research/biotechnology/genomics.jsp

Other examples are Millenium Pharmaceuticals http://www.millennium.com/ and Human Genome Sciences http://www.hgsi.com/

Anthony Petrolonis is a recent Wilkes Biology graduate who works for Millenium Pharmaceuticals http://www.millennium.com/ and Michael Donahue is a recent Wilkes Biology graduate who works for Human Genome Sciences http://www.hgsi.com/

One major purpose of datamining is knowledge discovery

Two types of knowledge discovery

  1. Directed (similar to hypothesis testing)

  • Purpose: explain a target field in terms of the others
  • pick a target field based on a hypothesis and ask the algorithm to classify or predict it
    • example: look for all genes that contain a homeobox
    • example: align twenty bacterial genomes and look for the trp operon and nearest neighbors

  1. Undirected

  • purpose: finding interesting patterns
    • example: align twenty bacterial genomes and look for sequences that tend to occur in the same place
    • example: find all genes that are turned on in particular cancers, or that respond in the same way to various treatments.

    Undirected KD helps identify relationships and directed KD helps explain them

Common tasks for data mining

  1. description
  2. classification: placing in categories based on characteristics
    • e.g. homeobox proteins, or G-protein-linked receptors
  3. Affinity grouping : grouping by relations (not characteristics)
    • e.g. genes activated by heat shock
  4. Clustering: segmenting a diverse population into similar groups according to some similarity measure
    • no pre-defined classes and no examples!
    • e.g. genes that occur in the same location in bacterial genomes
  5. Estimation: rules that explain how to estimate a value given characteristics
  6. Prediction: rules that explain how to predict a future value given characteristics

Applications for data mining in bioinformatics

  1. Identifying genes from different organisms that are related to each other
  2. Identifying genes within the same organism that are related to each other
    • helps to infer the function of unknown genes identified by sequencing genomes or cDNAs if find portions that are related to genes of known function
  3. Identifying genes that are expressed together in microarray experiments
  4. Identifying genes that are physically located next to each other in bacteria
    • Hypothesis: they are part of same biochemical pathway
  5. Identifying other structures that are functionally related

Many companies make their living mining biological databases: the challenge is writing software that can identify meaningful patterns

  • We don't fully understand molecular biology
  • Molecular biology has many quirks that make programming difficult
    • Redundancy
      • DNA is double-stranded
      • Many organisms are diploid (or more)
      • Many genes are present in multiple copies
    • The genetic code is degenerate
    • RNA processing, e.g. Splicing and editing
    • Some mutations have much more effect than others
    • Some amino acid substitutions have much more effect than others on protein structure
    • Genes are 1-D but function depends on the 3-D structure of their products
    • many proteins are modified after synthesis in ways that are presently difficult to predict
    • Many genes have multiple functions
    • many genes (or proteins) interact in ways that are presently difficult to predict

A common task for data mining in bioinformatics is searching data bases for specific types of sequences

  • related genes
  • sequences which occur in specific locations
  • interesting patterns
  • assembling sequences of entire genomes

General approach is to compare a query sequence with all the sequences in the database: somehow the sequences must be aligned.

References for sequence alignment

http://www.psc.edu/biomed/training/tutorials/sequence/db/index.html

http://www.umanitoba.ca/faculties/afs/plant_science/courses/39_769/lec04/lec04.1.html

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html

http://www.ncbi.nlm.nih.gov:80/BLAST/tutorial/Altschul-1.html

http://www.inf.ethz.ch/personal/cannaroz/courses/compbio/week2/week2/week2.html

Underlying assumptions:

  1. all life evolved from a common ancestor: therefore all genes and proteins have diverged from a common origin.
    • For example, many English and French words have evolved from a common ancestor, and we can recognize that they are related:
      • e.g. Ferret vs furet
    • The same is true for many English and German words
      • e.g. Enough vs genug
  2. We therefore assume that we can calculate the probability 2 sequences had a common ancestor and estimate the time since they had a common ancestor
  3. We also assume that evolution occurred stepwise via a series of mutations
  4. We assume that evolution obeys the rules of DNA replication and repair
    • bases change by transitions ( pyr -> pyr or pur -> pur) , transversions (pyr ->pur or pur ->pyr) or by insertions and deletions
    • we can estimate the probability of each kind of mutation"

    tautomers:

    depur:

    insert:

    transposition:

  5. selection acts on phenotype, therefore, some mutations will have no effect, others will be lethal!

These assumptions have been used to construct mathematical models of evolution based on probability. These models are then used to estimate

  1. the most likely way these sequences could have evolved from a common ancestor
  2. the probability this happened
  3. evolutionary distance is estimated from number of steps and probability of each one occurring

Many models are called "hidden Markov models" ref: http://www.msci.memphis.edu/~giri/compbio/papers/hmmshamir.pdf

They are based on Markov chains in which the probability of moving from one state to the next at each step depends only on state it is in

  • For example, the probability of mutating from an A to a G depends only on what the base is now, not on whether 10 generations ago it was a T.

The probability of moving from one state to the next at each step is called a transition probability

If know (or estimate) probability of each possible event can arrange in a transition matrix based on probability a particular mutation can occur and will be allowed

can estimate probability of reaching a particular state in n number of steps

compare observed events with probability matrix

Definitions

  • Similar -  The degree to which two species or populations share identities.
  • Homologous - characters that are similar due to common ancestry
  • Analogous - characters that are similar due to convergent evolution
  • Orthologous - characters that are homologous with conserved function
  • Paralogous - characters that are homologous with divergent function.

    similarity is a measurement!

    Homology is a judgement!

    Computers measure similarity!

    We attempt to infer homology!

Sequence Alignment

Computer scientists call this a string alignment problem, similar to routing phone calls

Think of this as a path: task is finding best one

A common approach for aligning 2 sequences is to start with a dot matrix in which you place a dot at each location that have an alignment

enough:

dot:

 

find best path that links the segments

path:

Searching Genbank is aligning a very long sequence!

Done using dynamic programming: approach developed by Needleman &Wunsch

Needleman-Wunsch assumes that can find optimal path by incrementing optimal subpaths

  • need a scoring mechanism!
  • Basic problem: correlating scoring with biology
  • General approach, deduct points for mismatches
    • Deduct more for gaps, since deletions are usually more serious = gap penalties
    • have large penalty for the first missing base, add smaller penalty for each additional base
    • How to deduct for mismatches?
      • For nucleic acids deduct more for transversions than transitions

      transition:

  • But: protein alignments are far more sensitive!
    • 20 possible amino acids cf 4 possible nucleotides
    • Changes in nucleotides often do not affect protein
    • Proteins have 3x fewer residues
    • Proteins are what selection acts on
    • Therefore, proteins are preferred for searches
  • Attempt to score amino acid mismatches based on consequence of substitution
    • Weighting in favor of "biologically acceptable"paths
    • Margaret Dayhoff waas the first to try this: she prepared a model of evolution in which some mutations are more acceptable than others
    • These were identified by comparing sequences of proteins from closely related organisms
    • She developed a unit called the Point Accepted Mutation (PAM) One PAM = 1% of amino acids have changed over evolution
    • The 1 PAM mutation matrix mutates amino acids with probability of 1%.
    • The 100 PAM mutation matrix mutates amino acids with probability of 100 mutations/100 amino acids
      • not every aa mutates and some mutate back to starting point!
    • To calculate score, need to account for differences in normal abundance of various amino acids
    • Dayhoff matrices are 10 x log (probability (a.a are paired due to common ancestry)/ (probability (pairing would occur by chance)) = log-odds table

      PAM:

       

    Using PAM

    • compare mismatches with PAM matrix to decide how important they are
    • For any possible amino acid substitution, a value from the scoring matrix is added to the similarity score
    • Therefore: is important to pick correct scoring matrix!
      • for closely related organisms rules of mutation are most important
      • for distantly-related organisms rules of amino acid substitution are most important
      • Therefore must pick matrix appropriate for expected sequence similarity
      • Use small PAM numbers (e.g PAM100) to compare closely related spp, large numbers (e.g. PAM250) for more distant
      • One criticism of the PAM matrices is that the PAM units were calibrated using pairs of closely-related proteins
    • Many other matrices have been developed based on other criteria
      • Blosum (BLOcks SUbstitutionMatrix) matrices were calculated using data from the BLOCKS database [http://www.blocks.fhcrc.org], which contains alignments of more distantly-related proteins
      • should be more realistic for comparing distantly-related proteins since are derived from distantly-related proteins. Derived empirically cf PAM extrapolated based on theory
      • BLOSUM 62 = minimum 62% identity
      • BLOSUM 30 = minimum 30 % identity-> suitable for comparing distant-related species
      • BLOSUM matrices are generally superior to PAM matrices for detecting biological relationships.
      • MDM10 and MDM20 matrices are better suited for closely related species
    • You will see all these different substitution matrices as options at various web sites that allow you to perform homology searches online
        N.B.: basic approach is similar, argument is how to weight gaps and mismatches!

    Algorithms compute score for "paths" by summing the values assigned by scoring matrix & gap penalties

    NWi,j = max{ NWi-1,j-1 + s(ai,bj); NWi-k,j + gj; NWi,j-k + gi }

    • Match = 1
    • Mismatch = -1
    • Gap = -2

    score matrix:

    Path with highest score wins

    They then compare score with value predicted by pure luck= E value

    • expected number of hits scoring this high by dumb luck after screening this many candidates
    • Theoretical reason for using smallest possible database: Each hit is more statistically significant!

    Usually compare with sequences generated randomly using a DNA or protein sequence model

    Alternative = comparing with composition of the entire database

    Needleman-Wunsch calculates global alignments: align entire sequences

    This often doesn't work well for aligning DNA or protein sequences, because of the following problems:

    • Biological: proteins are often composed of motifs & domains that are shuffled between different proteins
      • global alignments would miss these, instead scores similarity of entire protein
    • Statistical: theory for computing significance of global alignments is weak

    Local alignments are therefore weapon of choice

    • recognise conserved elements such as motifs and domains in proteins that are otherwise very different
      • e.g. homeodomains in homeobox proteins
    • strong statistical theory
    • 2 types of algorithm: gapped and ungapped
      • Ungapped has stronger theory, simpler algorithms
      • Gapped is more biologically relevant
        • identifies related sequences despite insertion/deletion

    Strongest but slowest local alignment algorithm is Smith-Waterman

    • Computes local alignments using Needleman-Wunsch algorithm
    • Key difference: stops alignment when score goes to 0
      • SWi,j = max{ SWi-1,j-1 + s(ai,bj); SWi-k,j + gj; SWi,j-k + gi; 0 }
      • N-M continues alignment adding negative cells
        • Match = 5, Mismatch = -4; Open Gap = 0, Extend Gap = -7

        SW:

    • Guaranteed to find best alignment!
    • Unfortunately, it is very slow
      • Therefore several faster heuristic (rule of thumb) methods have been developed to speed up initial search

    FASTA

    heuristic approximation of Smith-Waterman developed by William R. Pearson at the University of Virginia

    1) Breaks query and test sequences into overlapping words (2 a.a. or 6 nt) and looks for matches

    • rationale: homologous sequences should have runs of common nucleotides or amino acids

    2) next it scores the hits using substitution matrix

    3) it then tries to join high-scoring segments using gaps < a certain size (window size)

    4) Next it uses Smith-Waterman to find best alignment within the vicinity of high-scoring segments

    5) Finally, it computes the score for optimal alignment

    • uses linear regression against ln search set sequence length to calculate z-score
    • uses distribution of the z-score to estimate number of sequences expected to produce higher z-scores by chance

    Several different types of FASTA programs are available

    Fasta3 protein vs protein database or DNA vs DNA database

    Fastx3./fasty3 DNA translated in 3 frames vs protein database. Fastx3 allows frameshifts between codons, fasty also allows frameshifts within codons (slower)

    Tfastx3/tfasty3 protein vs DNA database allowing frameshifts

    FASTA:

    Many servers allow you to search genomic or proteiomic databases online. A few will also let you use Smith-Waterman (since it is so much more computationally intensive)

    http://fasta.bioch.virginia.edu/

    http://www.ebi.ac.uk/fasta33/

    http://www.ddbj.nig.ac.jp

    http://workbench.sdsc.edu/

    http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl

    http://searchlauncher.bcm.tmc.edu/seq-search/protein-search.html

    http://decypher.stanford.edu/index_by_algo.htm




  • Last update: Tuesday, January 21, 2003 at 11:53:34 AM.