Protein Structure Prediction: On the cusp between Futility ...

Protein Structure Prediction: On the cusp between Futility ...

Protein Structure Prediction: On the Cusp between Futility and Necessity? Thomas Huber Supercomputer Facility Australian National University Canberra email: [email protected] The ANU Supercomputer Facility Mission: support computational science through provision of HPC infrastructure and expertise ANU is host of APAC >1 Tflop (300-500 processors by 2002)

first machines now up and running Fujitsu collaboration at ANU System software development Computational chemistry project 5-6 persons porting and tuning of basic chemistry code to Fujitsu supercomputer platforms current code of interest Gaussian98, Gamess-US, ADF Mopac2000, MNDO94 Amber, GROMOS96 My work Fujitsu collaboration Responsible for MD software porting and tuning to Fujitsu Supercomputer platforms

Collaboration with The Institute for Physical and Chemical Research (Riken), Japan. Riken designed purpose specific hardware for MD simulation MD-machine >1Tflop sustained performance (20 Gflop per chip) Gorden Bell prize finalist (best performance for money) We wrote biomolecular simulation software Research Protein structure prediction Todays talk Something old Protein structure prediction

Basics of protein fold recognition How to build a low resolution force field Something new How to improve fold recognition Performance assessment Something for the future Where is fold recognition useful Perverting the concept of fold recognition Something new (for future work) Model calculations Protein Structure Prediction Two Approaches

Direct (ab initio) prediction Thermodynamics: Structures with low energy are more likely Prediction by induction Fold recognition More moderate goal: Recognise if sequence matches a protein structure Why is fold recognition attractive? Search problem notorious difficult Searching in a library of known folds: finding the optimum solution is guaranteed Is this useful? 104 protein structures determined

<103 protein folds Fold Recognition = Computer Matchmaking Structure Disco Why is Fold Recognition better than Sequence Comparison? Comparison is done in structure space not in sequence space Sausage: 2 step strategy Three basic choices in molecular modelling Representation

Which degrees of freedom are treated explicitly Scoring Which scoring function (force field) Searching Which method to search or sample conformational space Sequence-Structure Matching The search problem Gapped alignment = combinatorial nightmare Model Representation 1. Conventional MM (structure refinement)

4. Low resolution (structure prediction) Scoring Quality of prediction is given by E E ij ij Functional form of interactions simple continuous in function and derivative discriminate two states hyperbolic tangent function E ij k ij[ 1 t a n h ( d ij d 0 ) ] Parametrisation of

Discrimination Function Gaussian distribution (E E )2 N (E ) exp 2 2 E E z - sc o re = Minimisation of z-score with respect to parameters Size of Data Set 893 non-homologous proteins Representative subset of PDB

< 25% sequence identity 30-1070 amino acids >107 mis-folded structures 2 force fields Neighbour unspecific (alignment) 336 parameters Neighbour specific (ranking alignments) 996 parameter ! Parameters well determined ! Is Our Scoring Function Totally Artificial? No! Force field displays physics Trimer Stability

Nitrogen regulation proteins 2 protein (PII (GlnB) and GlnK) 112 residues sequence: 67% identities, 82% positives structure: 0.7 RMSD trimeric Dr S. Vasudevan: hetero-trimers Hetero-trimer Stability What is the most/least stable trimer Why use a low resolution force field? Structures differ (0.7 RMSD) Side chains are hard to optimise GlnK GlnB

Calculation: GlnB3 > GlnB2-GlnK > GlnB-GlnK2 > GlnK3 Experiment: GlnB3 > GlnB2-GlnK > GlnB-GlnK2 > GlnK3 Does it work with Fold Recognition? Blind test of methods (and people) methods always work better when one knows answer 30 proteins to predict 90 groups (40 fold recognition) Torda group (our methodology) one of them All results published in

Proteins, Suppl. 3 (1999). Fold Recognition Official Results (Alexin Murzin) Fold Recognition Predictions Re-evaluated (computationally by Arne Elofsson) Investigation of 5 computational (objective) evaluations Comparison with Murzins ranking Improvements to Fold Recognition Noise vs signal Average profiles

Geometry optimised structures Structure Optimisation X-ray structure high (atomic) resolution fits exactly 1 sequence Structure for fold recognition low resolution (fold level) should fit many sequences Optimise structure (coordinates) for fold recognition How are Structures Optimised? Goal: NOT to minimise energy of structure

BUT increase energy gap between correctly and incorrectly aligned sequences Deed: 20 homologous sequences (<95%) 20 best scoring alignments from (893) wrong sequences change coordinates to maximise energy gap between right and wrong restraint to X-ray structure (change <1 rmsd) 100 steps energy minimisation 500 steps molecular dynamics Hope: important structural features are (energetically) emphasised Effect of Structure

Optimisation Lyzosyme (153l_) Old Profile New Profile More Information about Structure Predicted secondary structure highly sophisticated methods secondary structure terms not well reproduced by force field easy to combine with force field term Correlated mutations in sequence d ij c ij

s i si s j sj i j can reflect distance information yet untested (by us) Where are we now? Cassandra package

fast O(N) alignment structural optimised library side chain modelling fully automatic predictions Extensive testing with big test sets Mock prediction for 595 test sequences Homologous structure with < 25% sequence identity in library 25%, homologous structure ranks #1 45% correct hit in top 10 average shift error of alignment 4

Confidence of prediction Predicting new folds Structure Prediction Olympics 2000 CASP4 experiment held April - September 2000 43 target sequences 30 no sequence homology detectable with sequence-sequence alignment techniques 154 prediction groups Cassandra predictions top 5 predictions for all targets are submitted no human intervention (why?) Leap frog or being frogged?

Results to be published in December CASP4: T111 Protein Name: enolase Organism: E. coli # amino acids: 436 Homologous sequence of known structure: YES! Structure solved by molecular replacement. -Blast search 4enl: Enolase

431 residues aligned 46% identities, 62% positives Expect = 10-100 Homologous structures to 4enl in fold library FSSP strucure-structure comparison 33 homologous structures < 13% sequence identity, > 3.6 RMSD, < 50% of full structure Name 1a49A 1byb 1nar 1b5tA 1aj2 1cnv

1qba 1dhpA 1onrA 4xis 1rpxA 1pud 1smd 1eceA 1oyc 1edg 1dosA 2dorA 1bd0A 1a4mA 2tpsA 1uroA 1aq0A 1tml

1uok 2plc 1nfp 1wab 1auz 1mtyG 8abp 1fuiA 1be1 Z RMSD nali 9.8 4.7 204 9.8 3.7 196 8.3 3.7 184 8.1 3.6 180 8.1 3.8 175 7.8 3.6 177

7.6 3.9 190 7.4 4.0 166 7.3 3.3 169 7.3 4.3 187 7.2 3.5 156 7.2 4.2 191 7.1 3.8 180 6.9 3.9 182 6.5 4.3 183 6.0 3.9 178 5.9 3.8 161 5.6 4.0 163 5.5 3.4 143 5.4 3.7 153 5.2 3.6 142 5.1 4.0 151 5.0 3.8 152 4.9 4.0 146

4.9 4.1 175 3.6 4.6 149 2.8 4.4 128 2.6 3.8 108 2.4 3.3 83 2.4 3.9 61 2.2 4.2 85 2.2 4.9 108 2.1 3.6 92 nstr 519 490 289

275 282 283 858 292 316 386 230 372 496 358 399 380 343 311 381 349 226

357 306 286 558 274 228 212 116 162 305 591 137 seqid 11 Opt_bin/1a49A.bin 6 Opt_bin/1byb_.bin 8 Opt_bin/1nar_.bin 6 Opt_bin/1b5tA.bin

8 Opt_bin/1aj2_.bin 9 Opt_bin/1cnv_.bin 7 Opt_bin/1qba_.bin 8 Opt_bin/1dhpA.bin 8 Opt_bin/1onrA.bin 13 Opt_bin/4xis_.bin 7 Opt_bin/1rpxA.bin 5 Opt_bin/1pud_.bin 6 Opt_bin/1smd_.bin 11 Opt_bin/1eceA.bin 9 Opt_bin/1oyc_.bin 7 Opt_bin/1edg_.bin 11 Opt_bin/1dosA.bin 8 Opt_bin/2dorA.bin 8 Opt_bin/1bd0A.bin 9 Opt_bin/1a4mA.bin 5 Opt_bin/2tpsA.bin 8 Opt_bin/1uroA.bin

6 Opt_bin/1aq0A.bin 10 Opt_bin/1tml_.bin 8 Opt_bin/1uok_.bin 7 Opt_bin/2plc_.bin 7 Opt_bin/1nfp_.bin 7 Opt_bin/1wab_.bin 1 Opt_bin/1auz_.bin 4 Opt_bin/1mtyG.bin 11 Opt_bin/8abp_.bin 6 Opt_bin/1fuiA.bin 8 Opt_bin/1be1_.bin T111: Cassandra prediction Sorted by score: score 7533.9 7269.9 7112.5

7016.9 7009.3 6959.4 6866.3 6810.6 6788.4 6785.8 6783.6 6771.2 . . nali 324 309 298 359 329

333 323 303 352 277 284 364 name "1a4mA" "1onrA" "1rkd_" "1ch6A" "1dosA" "3pte_" "1uroA" "1cipA" "1smd_"

"1a4iB" "1dhpA" "1ajsA" adenosine deaminase transaldolase ribokinase glutamate dehydrogenase aldolase class ii d-alanyl-d-alanine carboxypeptidase uroporphyrinogen decarboxylase guanine nucleotide-binding protein amylase methylenetetrahydrofolate dehydrogenase dihydrodipicolinate synthase aspartate aminotransferase T111: Cassandra prediction

Sorted by score: score 7533.9 7269.9 7112.5 7016.9 7009.3 6959.4 6866.3 6810.6 6788.4 6785.8 6783.6 6771.2 . . nali

324 309 298 359 329 333 323 303 352 277 284 364 name "1a4mA" "1onrA" "1rkd_" "1ch6A"

"1dosA" "3pte_" "1uroA" "1cipA" "1smd_" "1a4iB" "1dhpA" "1ajsA" adenosine deaminase transaldolase ribokinase glutamate dehydrogenase aldolase class ii d-alanyl-d-alanine carboxypeptidase uroporphyrinogen decarboxylase guanine nucleotide-binding protein amylase

methylenetetrahydrofolate dehydrogenase dihydrodipicolinate synthase aspartate aminotransferase Probability of this result by chance: p = 1.3610-9 BUT: Alignment is shifted!!! -Blast prediction is much better. Summary Urgency of Prediction sequencing: fast & cheap structure determination: hard & expensive 104 structures are determined insignificant compared to all proteins Fold recognition

a feasible way to predict protein structure is not perfect (9/10, 1/4) requires special scoring functions Low resolution scoring functions knowledge based from database of known protein structures only meaningful when database is big data mining? not necessarily physical BUT capture important physical features Future work Large scale structure prediction Fold recognition on genomic scale

20% predicted protein >> whats in PDB putative proteins new folds from structure to function (maybe too hard) why our CASP submissions are fully automatic Experimentally assisted structure prediction cross linking & MS Prediction based structure determination structure determination is much easier if a tentative model is already known use experiment to confirm prediction

What else? The inverse problem Is there a sequence match for a structure? Applications for the inverse problem Fishing for putative sequences in genomic ponds Better sequences for proteins What is better? More stable More soluble

Better to crystallise Better function etc. Rational Protein Design GlnB Is there a better sequence for GlnB structure? Example GlnB metallochaperone ribosomal protein GlnB 11%

8% papillomavirus DNA binding domain acylphosphatase 10% 11% Nature uses same fold motif for different functions Why important? metallochaperone ribosomal protein GlnB

11% 8% papillomavirus DNA binding domain acylphosphatase 10% 11% Minimalistic proteins Many industrial applications E.g. enzymes in washing powder should be stable at high temperatures work faster at low temperature

Nave Concoction Use energy score e.g. score from low resolution force field Change sequence to lower energy Why nave? Comparing energies of different sequences is like comparing apples with potatoes Free energy is all important measure Is it possible to capture free energy in a simple function? Model Calculations on a Simple Lattice Explore model protein universe Square lattice Simple hydrophobic/polar

energy function (HH=1, HP=PP=0) Chains up to 16-mers evaluation of all conformations (exact free energy) for all possible sequences Our small universe 802074 self avoiding conformations 216 = 65536 sequences 1539 (2.3%) sequences fold to unique structure 456 folds 26 sequences adopt most common fold

Free energy approximation Question: Is there a simple function which approximates free energy Calculate free energies for all sequences Select folding sequences and use them to fit new scoring function correlate free energy and approximated free energy for all sequences Using simple 3 parameter HP matrix for fit does not work well BUT ... Extended Functional Form (5 parameters)

People Sausage Andrew Torda (RSC) Dan Ayers (RSC) Zsuzsa Dosztanyi (RSC) Anthony Russell (RSC) GlnB/GlnK Subhash Vasudevan (JCU) David Ollis (RSC) At ANUSF Alistair Rendell

Want to try yourself? Sausage and Cassandra freely available [email protected]

Recently Viewed Presentations

  • Chapter Technology Transfer Committee Workshop

    Chapter Technology Transfer Committee Workshop

    ASHRAE Technology Awards - Categories. Commercial Building (New, Existing, and EBCx). Institutional Buildings (New, Existing, and . EBCx) Educational Facilities (New ...
  • Information E-Learning Project CH

    Information E-Learning Project CH

    Präsentation NTG WG IT&ED in Wien. 23 September 08, Vienna AFJS, E learning management Training Command (J7) Meaning of E-Learning for the Swiss Armed Forces E-Learning is an additional and alternative training and learning method.
  • Motivation


    Types of Motivation. Motive - An internal mechanism that selects and directs behavior. The term motive is often used in the narrower sense of a motivational process that is learned, rather than biologically based (as are drives).
  • November 1, 2016 Objective: To create a shareable

    November 1, 2016 Objective: To create a shareable

    Identify and define irony, allusion, and point of view. Identify and define a verb. Distinguish between action verbs, linking verbs, and helping verbs. Discuss theme and how it is developed throughout a work. Write a narrative. Write a text-based analysis...
  • Observations How we look at things in science

    Observations How we look at things in science

    The Latin word "Dei" means "God." The coin was made by deeply religious people. The date 1722 is printed on one side of the coin. The coin was made in 1722. The face on the coin is a representation of...
  • Neck Region - Misericordia University

    Neck Region - Misericordia University

    Be able to describe the fascia of the neck and apply its significance to clinically important conditions. Be able to describe the borders of triangles of the neck and list their components. Be able to identify thyroid structures including anatomical...
  • Technical Computing Initiative - University of Southampton

    Technical Computing Initiative - University of Southampton

    e-Science and Cyberinfrastructure: A Middleware Perspective Tony Hey Corporate VP for Technical Computing Microsoft Corporation Licklider's Vision "Lick had this concept - all of the stuff linked together throughout the world, that you can use a remote computer, get data...

    Section 327 - It is a crime to conceal, disguise, convert, transfer criminal property. Section 328 - It is a crime to "become concerned in an arrangement" which he or she knows or suspects facilitates the acquisition retention or use...