bioinformatics

 
Home

syllabus

general information

homework

lectures

websites

Will Terzaghi's Homepage



Membership

Login

 
 

week 8 lecture

Modeling Structures

Predicting RNA secondary structures

RNA folds back on itself to form double-stranded regions called stems that are stabilized by H-bonding between complementary bases interspersed with single-stranded loops and bulges

Some of these single-stranded loops can bond to other loops elsewhere in the molecule to form pseudo-knots

Accurate prediction of RNA structure is becoming increasingly important as we discover more RNA molecules which serve structural roles and/or catalyze reactions

rRNA:

2 options:

  1. energy minimization: use models of how bases interact with water and each other to predict the structure with the lowest free energy
  • Conceptually it is similar to sequence alignment!
  • try to find the best path = path with lowest energy
  • look for hits, then perform dynamic programming until have found best path = lowest energy
  • Mfold, a widely used program developed by Zuker, breaks the molecule into smaller regions that can interact.
    • It then calculates the energetics of the interactions between neighboring bases.
    • A set of "nearest neighbor" energy rules are used to calculate the energy of the entire structure.
    • A very good explanation of modeling RNA structures and a server that will fold RNA for you are posted at http://www.bioinfo.rpi.edu/~zukerm/
dot: path:

knowledge-based

  1. identify related sequences of known structure (e.g. from X-ray crystallography or NMR)
  2. assume related regions will adopt similar structure
  3. use energy minimization to predict the structure of the portions that vary from the known sequence

Predicting Protein structures

Goal is to take the primary structure (the amino acid sequence) and predict the three-dimensional structure

We're not there yet, but we have made significant progress.

Problem is that protein-folding is a complex process that isn't fully understood

Proteins fold in a stepwise manner due to interactions with water and with each other

  1. The amino and carboxyl groups of some amino acids H-bond to form secondary structures
  2. Certain secondary structures interact to form motifs
  3. motifs aggregate to form domains
  4. domains aggregate to form the tertiary structure of a polypeptide
  5. polypeptides aggregate to form the quaternary structure of a multi-subunit protein

Therefore, many programs attempt to simulate this process.

First step is predicting secondary structure: COOH of backbone of one a.a. H-bonds to backbone NH of another

4 options

  1. Alpha helix
    • bonds form between a.a.s within a chain
    • spiral due to H-bonds formed at regular intervals
    • side-chains face the outside of the helix
  2. Beta-pleated sheet
    • H-bonds form between chains
    • Forms sheets which may be flat, or somewhat twisted
    • side-chains are above and below the plane of the sheet
  3. Beta-turn
    • COOH of backbone of one a.a. H-bonds to backbone NH of the amino acid at the plus 3 position
    • this stabilizes abrupt changes in the direction of a chain
    • are found at the surfaces of proteins
  4. Random coil
    • a catch-all group for structures that aren't stabilized by H-bonds between the backbone COOH and NH groups
    • sometimes subdivided into "omega loops" and everything else
    • "omega loops" reverse the direction of a chain and are also found protein surfaces
      • do not form regular, periodic structures, but are often rigid and well-defined
      • because they are found at surfaces and have well-defined shapes, are also useful for modeling structure
      • often participate in interactions between the polypeptide and other molecules
secondary:

Modeling Secondary Structures

Chou-Fasman is one of most commonly used algorithms

  • measured frequencies at which each amino acid appeared in particular types of secondary sequences in a set of proteins of known structure
  • assigns the amino acids three conformational parameters based on the frequency at which they were observed in alpha helices, beta sheets and beta turns
    1. P(a) = propensity to form alpha helices
    2. P(b) = propensity to form beta sheets
    3. P(turn) = propensity to form beta turns
  • also assigns 4 turn parameters based on frequency at which they were observed in the first, second, third or fourth position of a beta turn
    1. f(i) = probability of being in position 1
    2. f(i+1) = probability of being in position 2
    3. f(i+2) = probability of being in position 3
    4. f(i+3) = probability of being in position 4
  • The Chou-Fasman parameters for the 20 common amino acids.

    A.A.
    P(a)
    P(b)
    P(turn)
    f(i)
    f(i+1)
    f(i+2)
    f(i+3)
    Alanine
    142
    83
    66
    0.060
    0.076
    0.035
    0.058
    Arginine
    98
    93
    95
    0.070
    0.106
    0.099
    0.085
    Asparagine
    67
    89
    156
    0.161
    0.083
    0.191
    0.091
    Aspartic acid
    101
    54
    146
    0.147
    0.110
    0.179
    0.081
    Cysteine
    70
    119
    119
    0.149
    0.050
    0.117
    0.128
    Glutamic acid
    151
    37
    74
    0.056
    0.060
    0.077
    0.064
    Glutamine
    111
    110
    98
    0.074
    0.098
    0.037
    0.098
    Glycine
    57
    75
    156
    0.102
    0.085
    0.190
    0.152
    Histidine
    100
    87
    95
    0.140
    0.047
    0.093
    0.054
    Isoleucine
    108
    160
    47
    0.043
    0.034
    0.013
    0.056
    Leucine
    121
    130
    59
    0.061
    0.025
    0.036
    0.070
    Lysine
    114
    74
    101
    0.055
    0.115
    0.072
    0.095
    Methionine
    145
    105
    60
    0.068
    0.082
    0.014
    0.055
    Phenylalanine
    113
    138
    60
    0.059
    0.041
    0.065
    0.065
    Proline
    57
    55
    152
    0.102
    0.301
    0.034
    0.068
    Serine
    77
    75
    143
    0.120
    0.139
    0.125
    0.106
    Threonine
    83
    119
    96
    0.086
    0.108
    0.065
    0.079
    Tryptophan
    108
    137
    96
    0.077
    0.013
    0.064
    0.167
    Tyrosine
    69
    147
    114
    0.082
    0.065
    0.114
    0.125
    Valine
    106
    170
    50
    0.062
    0.048
    0.028
    0.053

identifies helix and sheet”nuclei”, then applies a set of heuristic rules to determine if these clusters of amino acids are sufficient to nucleate a region of alpha-helix or beta-sheet.

  • helix: 4 out of 6 amino acids with P(a) >100
    • extends the nucleus in each direction until reach four amino acids in a row with P(a) <100
    • for each of these regions, add up all the P(a) and all the P(b) values.
    • If the total P(a) is larger than the total of P(b) and the run is more than 5 amino acids long, then it is predicted to be alpha helix
  • sheet: 4 out of 6 amino acids with P(b)>100 (some people use 3 out of 5).
    • extends the nucleus in each direction until reach four amino acids in a row with P(b) <100
    • for each of these regions, add up all the P(a) and all the P(b) values.
    • If the total P(b) is larger than the total of P(a), the run is more than 5 amino acids long, and the average P(b) > 100 then it is predicted to be beta sheet.
  • If helices and sheets overlap then compare the total P(a) and total P(b) for the overlapping region. If the total P(a) is larger than the total of P(b) then it is predicted to be alpha helix (and vice-versa)
  • beta turn
    • calculate the likelihood of a turn P(t)for amino acid at position i as the sum of f(i) + the f(i+1) value for the following amino acid + the f(i+2) value for the next amino acid+ the f(i+3) value for the amino acid at the plus three position.
    • Predict a beta- turn at position i if the following criteria are met:
      • the calculated P(t) is >0.5
      • the average P(turn) for amino acids i to i+3 is > 100
      • the sum of the P(turn) values for amino acids i to i+3 is larger than the sum of the P(a) and P(b)values
  • Accuracy = 50-85%, depending on the protein

Another commonly used algorithm, GOR (Garnier, Osguthorpe and Robson) uses a window of 17 amino acids to predict secondary structure

  • rationale: experiments show each amino acid has a significant effect on the conformation of amino acids up to 8 positions in front or behind it.
  • a collection of 25 proteins of known structure was analyzed, and the frequency at which each amino acid was found in helix, sheet, turn or coil within the 17 position window was determined
    • this creates a 17 *20 scoring matrix that is used to calculate the most likely conformation of each amino acid within the 17 a.a. window
  • This window slides down the primary sequence, scoring the most likely conformation for each amino acid based on the neighboring amino acids.
  • Accuracy is about 65%

Many other programs for modeling secondary structure and motifs are available

A one paragraph summary of most of these methods is posted at http://npsa-pbil.ibcp.fr/NPSA/npsa_references.html

many of these methods are available at our old friends

PredictProtein is another server that provides many useful tools for predicting secondary structures and other protein features http://bio.cigb.edu.cu/predictprotein/

Net Protein Sequence Analysis http://npsa-pbil.ibcp.fr/NPSA/ provides a number of programs for protein structure prediction

Some are variations on the Chou-Fasman approach that use different scoring matrices and weights

Others are knowledge-based: e.g SIMPA96

  • predicts that short homologous sequences of amino acids will have the same secondary structure
  • takes 7 amino acid windows and searches through a database of sequences of known structure searching for homologous stretches

    Topits (at PredictProtein) tries to detect similarity of a secondary structure and accessibility between a sequence of unknown structure and a known fold

    DSC uses multiple-sequence alignment

Some combine Chou-Fasman and knowledge-based approaches: e.g. DPM

  • makes two independent predictions based on a modified Chou-Fasman approach and a prediction based on its amino acid composition
  • use the predictions based on amino acid composition to resolve conflicts in the C-F prediction

PHD uses a neural network approach

Neural networks are one way for computers to learn = Artificial Intelligence

Nets can deal with new patterns and generalize from training sets.

e.g. optical character recognition software, other more advanced omage processing software

Nets are good at `perceptual' tasks and associative recall

These are tasks that the symbolic approach has difficulties with

mimics organization of the brain

  • Uses a network of processors (neurons) rather than a single CPU
  • Train them rather than program them!
  • Learn patterns by trial and error, rather than by rules
  • Incorrect guess weakens circuit pattern, correct guess strengthens it
  • The processing ability of the network is stored in the inter-unit connection strengths
  1. Signals appear at the unit's inputs.
  2. Signal weight is computed
  3. Weighted signals are summed to give an overall activation
  4. If activation exceeds a threshold an output is produced
nn1: nntraining:

Neural networks are trained by adjusting weights

PHD trained its using a database of soluble proteins

Best approach is to take the consensus from a number of different programs

this is done by many programs, e.g CONSENSUS SECONDARY STRUCTURE PREDICTION at NPSA

http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_seccons.html

jPRED is another server that uses a consensus from neural network and homology approaches

http://www.compbio.dundee.ac.uk/~www-jpred/

PELE at DNA workbench runs the sequence through 8 different programs, gives you the output for each and a consensus

To move from secondary to tertiary structure we need to predict which portions are likely to be at the surface and which will be buried in the middle, which are likely to be embedded in membranes, etc.

Many programs predict other features of proteins

Predicting hydrophobicity

  • Interactions with water drive protein folding
  • can model these interactions to predict transmembrane domains or hydrophobic pockets
  • Kyte and Doolittle developed a widely-used algorithm for predicting hydrophobicity
    • assigns each amino acid a hydropathy score based on free energy change for moving into water
      • negative scores are hydrophilic
      • positive are hydrophobic
    • next calculates a moving average along the protein, much like the GOR algorithm, in which the free energy change for moving that portion of the protein into water is calculated
      • user specifies length of the window
      • recommended spans are 7-11 residues
    • Program then plots average hydrophobicity of the moving window at each residue (i.e., at position 10 it is plotting the average hydropathy of residues 5-15)
      • Useful for predicting transmembrane domains or hydrophobic pockets
      • this can be done in MacVector
      • GREASE in DNA workbench will also do this for you
      • so will Protein Hydrophilicity/Hydrophobicity Search in SearchLauncher
  • Several other scoring matrices have been developed, e,g Hopp-Woods and GES, but general idea is similar
  • PHDacc uses a neural network approach to predict the solvent accessibility of amino acids: i.e, whether they will be a the surface or buried
grease:

Various programs model specific types of structure

  • Coiled coils http://www.ch.embnet.org/software/coils/COILS_doc.html
    COILS compares a sequence to a database of known parallel two-stranded coiled-coils, derives
    a similarity score, then calculates the probability that the sequence will adopt a coiled-coil conformation.
  • many programs predict transmembrane regions and orientation
  • other programs predict antigenicity of various portions of the protein
    • some search for peaks likely to be exposed at the surface
    • others search for flexibility, another structural component thought to contribute to antigenicity
      • each amino acid is assigned a rigidity score, then a sliding window of three amino acids is used to calculate the rigidity of each amino acid in the sequence
  • Some programs predict amphiphilicity: structures that will be polar on one side and non-polar on the other
    • amphiphilic domains are often found at membrane protein interfaces, e.g. in channel proteins, or at protein solvent interfaces.
  • Finally, many programs have been written to identify specific motifs and domains, such as nuclear localization signals, chloroplast transit peptides, Helix-turn-helix motifs, etc

 




Last update: Thursday, March 13, 2003 at 4:31:26 PM.