| Home
syllabus
general information
homework
lectures
websites
Will Terzaghi's Homepage
Membership
Login |
|
|
|
week 8 lecture
Predicting RNA secondary structures
RNA folds back on itself to form double-stranded
regions called stems that are stabilized by H-bonding between complementary
bases interspersed with single-stranded loops and bulges
Some of these single-stranded loops can bond
to other loops elsewhere in the molecule to form pseudo-knots
Accurate prediction of RNA structure is becoming
increasingly important as we discover more RNA molecules which serve structural
roles and/or catalyze reactions
2 options:
- energy minimization: use models
of how bases interact with water and each other to predict the structure with
the lowest free energy
- Conceptually it is similar to sequence alignment!
- try to find the best path =
path with lowest energy
- look for hits, then perform dynamic programming
until have found best path = lowest energy
- Mfold, a widely used program developed by
Zuker, breaks the molecule into smaller regions that can interact.
- It then calculates the energetics of
the interactions between neighboring bases.
- A set of "nearest neighbor"
energy rules are used to calculate the energy of the entire structure.
- A very good explanation of modeling RNA
structures and a server that will fold RNA for you are posted at http://www.bioinfo.rpi.edu/~zukerm/
knowledge-based
- identify related sequences of known structure
(e.g. from X-ray crystallography or NMR)
- assume related regions will adopt similar
structure
- use energy minimization to predict the structure
of the portions that vary from the known sequence
Predicting Protein structures
Goal is to take the primary structure (the
amino acid sequence) and predict the three-dimensional structure
We're not there yet, but we have made significant
progress.
Problem is that protein-folding is a complex process
that isn't fully understood
Proteins fold in a stepwise manner due to interactions
with water and with each other
- The amino and carboxyl groups of some amino acids
H-bond to form secondary structures
- Certain secondary structures interact to form motifs
- motifs aggregate to form domains
- domains aggregate to form the tertiary structure
of a polypeptide
- polypeptides aggregate to form the quaternary structure
of a multi-subunit protein
Therefore, many programs attempt to simulate this
process.
First step is predicting secondary structure: COOH
of backbone of one a.a. H-bonds to backbone NH of another
4 options
- Alpha helix
- bonds form between a.a.s within a chain
- spiral due to H-bonds formed at regular intervals
- side-chains face the outside of the helix
- Beta-pleated sheet
- H-bonds form between chains
- Forms sheets which may be flat, or somewhat
twisted
- side-chains are above and below the plane of
the sheet
- Beta-turn
- COOH of backbone of one a.a. H-bonds to backbone
NH of the amino acid at the plus 3 position
- this stabilizes abrupt changes in the direction
of a chain
- are found at the surfaces of proteins
- Random coil
- a catch-all group for structures that
aren't stabilized by H-bonds between the backbone COOH and NH groups
- sometimes subdivided into "omega
loops" and everything else
- "omega loops" reverse the direction
of a chain and are also found protein surfaces
- do not form regular, periodic structures,
but are often rigid and well-defined
- because they are found at surfaces and
have well-defined shapes, are also useful for modeling structure
- often participate in interactions between
the polypeptide and other molecules
Modeling Secondary Structures
Chou-Fasman is one of most commonly used algorithms
identifies helix and sheet”nuclei”, then applies a set
of heuristic rules to determine if these clusters of amino acids are sufficient
to nucleate a region of alpha-helix or beta-sheet.
- helix: 4 out of 6 amino acids with P(a) >100
- extends the nucleus in each direction
until reach four amino acids in a row with P(a) <100
- for each of these regions, add up all
the P(a) and all the P(b) values.
- If the total P(a) is larger than the total
of P(b) and the run is more than 5 amino acids long, then it is predicted
to be alpha helix
- sheet: 4 out of 6 amino acids with P(b)>100
(some people use 3 out of 5).
- extends the nucleus in each direction
until reach four amino acids in a row with P(b) <100
- for each of these regions, add up all
the P(a) and all the P(b) values.
- If the total P(b) is larger than the total
of P(a), the run is more than 5 amino acids long, and the average P(b)
> 100 then it is predicted to be beta sheet.
- If helices and sheets overlap then compare
the total P(a) and total P(b) for the overlapping region. If the total P(a)
is larger than the total of P(b) then it is predicted to be alpha helix (and
vice-versa)
- beta turn
- calculate the likelihood of a turn P(t)for
amino acid at position i as the sum of f(i) + the f(i+1) value for the following
amino acid + the f(i+2) value for the next amino acid+ the f(i+3) value
for the amino acid at the plus three position.
- Predict a beta- turn at position i if the
following criteria are met:
- the calculated P(t) is >0.5
- the average P(turn) for amino acids
i to i+3 is > 100
- the sum of the P(turn) values for amino
acids i to i+3 is larger than the sum of the P(a) and P(b)values
- Accuracy = 50-85%, depending on the protein
Another commonly used algorithm, GOR (Garnier,
Osguthorpe and Robson) uses a window of 17 amino acids to predict secondary
structure
- rationale: experiments show each amino acid
has a significant effect on the conformation of amino acids up to 8 positions
in front or behind it.
- a collection of 25 proteins of known structure
was analyzed, and the frequency at which each amino acid was found in helix,
sheet, turn or coil within the 17 position window was determined
- this creates a 17 *20 scoring matrix that is used to calculate
the most likely conformation of each amino acid within the 17 a.a. window
- This window slides down the primary sequence,
scoring the most likely conformation for each amino acid based on the neighboring
amino acids.
- Accuracy is about 65%
Many other programs for modeling secondary structure
and motifs are available
A one paragraph summary of most of these methods is posted at http://npsa-pbil.ibcp.fr/NPSA/npsa_references.html
many of these methods are available at our
old friends
PredictProtein is another server that provides
many useful tools for predicting secondary structures and other protein features
http://bio.cigb.edu.cu/predictprotein/
Net Protein Sequence Analysis
http://npsa-pbil.ibcp.fr/NPSA/ provides a number of
programs for protein structure prediction
Some are variations on the Chou-Fasman approach
that use different scoring matrices and weights
Others are knowledge-based: e.g SIMPA96
Some combine Chou-Fasman and knowledge-based
approaches: e.g. DPM
- makes two independent predictions based on
a modified Chou-Fasman approach and a prediction based on its amino acid composition
- use the predictions based on amino acid composition
to resolve conflicts in the C-F prediction
PHD uses a neural network approach
Neural networks are one way for computers to
learn = Artificial Intelligence
Nets can deal with new patterns and generalize
from training sets.
e.g. optical character recognition software,
other more advanced omage processing software
Nets are good at `perceptual' tasks and
associative recall
These are tasks that the symbolic approach
has difficulties with
mimics organization of the brain
- Uses a network of processors (neurons) rather
than a single CPU
- Train them rather than program them!
- Learn patterns by trial and error, rather
than by rules
- Incorrect guess weakens circuit pattern,
correct guess strengthens it
- The processing ability of the network is
stored in the inter-unit connection strengths
- Signals appear at the unit's inputs.
- Signal weight is computed
- Weighted signals are summed to give an overall
activation
- If activation exceeds a threshold an output
is produced
Neural networks are trained by adjusting weights
PHD trained its using a database
of soluble proteins
Best approach is to take the consensus from a
number of different programs
this is done by many programs, e.g CONSENSUS
SECONDARY STRUCTURE PREDICTION at NPSA
http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_seccons.html
jPRED is another server that uses a consensus
from neural network and homology approaches
http://www.compbio.dundee.ac.uk/~www-jpred/
PELE at DNA workbench runs the sequence through
8 different programs, gives you the output for each and a consensus
To move from secondary to tertiary structure
we need to predict which portions are likely to be at the surface and which
will be buried in the middle, which are likely to be embedded in membranes,
etc.
Many programs predict other features of proteins
Predicting hydrophobicity
- Interactions with water drive protein folding
- can model these interactions to predict
transmembrane domains or hydrophobic pockets
- Kyte and Doolittle developed a widely-used
algorithm for predicting hydrophobicity
- assigns each amino acid a hydropathy
score based on free energy change for moving into water
- negative scores are hydrophilic
- positive are hydrophobic
- next calculates a moving average along the protein, much like
the GOR algorithm, in which the free energy change for moving that portion
of the protein into water is calculated
- user specifies length of the window
- recommended spans are 7-11 residues
- Program then plots average hydrophobicity
of the moving window at each residue (i.e., at position 10 it is plotting
the average hydropathy of residues 5-15)
- Useful for predicting transmembrane domains or hydrophobic
pockets
- this can be done in MacVector
- GREASE in DNA workbench will also
do this for you
- so will Protein Hydrophilicity/Hydrophobicity
Search in SearchLauncher
- Several other scoring matrices have been
developed, e,g Hopp-Woods and GES, but general idea is similar
- PHDacc uses a neural network approach to
predict the solvent accessibility of amino acids: i.e, whether they will
be a the surface or buried
Various programs model specific types of structure
- Coiled coils http://www.ch.embnet.org/software/coils/COILS_doc.html
COILS compares a sequence to a database of known parallel
two-stranded coiled-coils, derives a
similarity score, then calculates the probability that the sequence will adopt
a coiled-coil conformation.
- many programs predict transmembrane regions
and orientation
- other programs predict antigenicity of various portions of the protein
- some search for peaks likely to be exposed
at the surface
- others search for flexibility, another structural
component thought to contribute to antigenicity
- each amino acid is assigned a rigidity
score, then a sliding window of three amino acids is used to calculate
the rigidity of each amino acid in the sequence
- Some programs predict amphiphilicity: structures
that will be polar on one side and non-polar on the other
- amphiphilic domains are often found at
membrane protein interfaces, e.g. in channel proteins, or at protein solvent
interfaces.
- Finally, many programs have been written to
identify specific motifs and domains, such as nuclear localization signals,
chloroplast transit peptides, Helix-turn-helix motifs, etc

|