AUTOMUTE

AUTO
mated server for predicting...
      ...functional consequences of amino acid MUTations in protEins

tessellated_t4_lysozyme

 

AUTO-MUTE Home

Stability Changes (ΔΔG)

Stability Changes (ΔΔGH2O)

Stability Changes (ΔTm)

Activity Changes

Disease Potential of Human nsSNPs

Structural Bioinformatics at
George Mason University

Questions or Comments?
mmasso@gmu.edu

Stability Changes (ΔΔGH2O) Documentation

Introduction

Stability Changes (ΔΔGH2O) is an automated server for predicting the impact of single amino acid replacements on protein stability due to denaturant denaturation. The models available were trained with a modified version of a mutant protein dataset, previously reported in Saraboji et al. (2006) (average assignment method) and also utilized by Huang et al. (2007) (CART method), which was obtained by searching the ProTherm database. The original dataset consisted of 2204 single amino acid substitutions in 88 proteins with solved structures in the Protein Data Bank (PDB); however, after removing mutants associated with structures containing gaps, as well as mutants at positions with fewer than six nearest neighbors in tessellated protein structures, the modified dataset utilized here consists of 1962 single point mutants in 77 protein structures.

Depending upon the needs of the investigator, two supervised classification models (for predicting only the sign of ΔΔGH2O) and two regression models (for predicting the actual value of ΔΔGH2O) are available. The supervised classification models include Random Forest (RF) and Support Vector Machine (SVM), while the regression models include Tree regression (REPTree) and SVM regression (SVMreg). Similarly, the decision as to which of the two classification (or regression) models to select rests with the algorithmic preference of the researcher. Athough the two models of each type can be ranked based on various performance measures, which are detailed in the Results section below, these measures are relatively similar in magnitude and are not necessarily indicative of the predictive accuracy of the models on an independent test set of single point mutants that have yet to be experimentally investigated. Additionally, since the models were developed using implementations of four different machine learning algorithms, there likely will be occasions where predictions for a specific mutant are inconsistent among the models, especially in cases where the sign of ΔΔGH2O is predicted with low confidence.

Lastly, predictions can only be performed for a mutant if it represents a single amino acid substitution in a tessellatable single chain of a solved protein structure with a coordinate file available in the PDB. Specifically, Delaunay tessellation of the structure can be performed if the PDB file contains consecutive primary sequence numbering in the ATOM lines (i.e., no gaps in the structure) starting with a non-negative integer, the alpha-carbon atomic coordinates are available for all the constituent amino acids, and no alternative conformations exist for the alpha-carbon atoms. In addition to X-ray structures, NMR structure files are potentially tessellatable if they consist of a single minimized average structure as opposed to multiple models.

Methods

Among the numerous factors influencing model performance are the dataset size and composition utilized for training, the type of attributes (i.e., predictors) used as components for the feature vectors characterizing the mutants in the dataset, and the machine learning algorithm chosen for model building. AUTO-MUTE utilizes attributes that include EC score at the mutated position (mutant protein residual score), ordered EC scores of the six nearest neighbors to the mutated position, native and replacement amino acid identities at the mutated position, ordered amino acid identities at the six nearest neighbors, and ordered differences between the primary sequence positions of the nearest neighbors and the mutated residue (see AUTO-MUTE home page for details).


T
he following attributes were also initially included for each single point mutant: relative solvent accessibility (RSA), as well as temperature and pH of the experimental conditions under which ΔΔGH2O measurements were obtained. However, the models provided on this server for making predictions were trained by replacing RSA with the following Delaunay tessellation-derived attributes: mean volume and tetrahedrality of the simplices in which the mutated position serves as a vertex, location (surface, undersurface, or buried) of the mutated position, number of edge contacts that the mutated position has with surface positions, and secondary structure (helix, strand, coil, or turn) of the mutated position (see AUTO-MUTE home page for details). As described in the Results section below, there is a negligible difference in model performance as a result of such an alteration in the training set mutant feature vectors.

Required Inputs and Server Outputs

A valid PDB accession code and a specific chain (use @ if null) is required for the structure of the protein containing a single residue substitution whose impact on stability (
ΔΔGH2O sign or value) is to be predicted. The mutation under consideration must be supplied in the form (native residue)(position number from PDB file ATOM lines)(replacement residue), for example D25E; however, by using an underscore "_" instead of a replacement residue, such as D25_ for example, predictions will be provided for all 19 amino acid substitutions at the requested position. The final inputs include the temperature (ºC, 0-100) and pH (0-14) conditions under which predictions are to be obtained.

In addition to reproducing the inputs, the output data includes either predicted sign of ΔΔGH2O along with a confidence level (classification) or predicted value of ΔΔGH2O (regression), mean volume and tetrahedrality for the mutated position, location and number of edge contacts that the mutated position has with surface positions, and secondary structure of the mutated position.

Results

In order to directly compare our results with those of the classification and regression tree (CART) approach of Huang et al. (2007),
performance of the algorithms is evaluated by applying a 5-fold cross-validation procedure and calculating the following values. For supervised classification, each mutant belongs to either the “increased stability” or “+” class if experimental ΔΔGH2O ≥ 0, or the “decreased stability” or “–”  class if ΔΔGH2O < 0. With the understanding that TP (TN) = total number of correctly predicted “increased stability” (“decreased stability”) mutants, and FN (FP) = total number of respectively misclassified mutants, the overall accuracy is defined as
 
Q = (TP + TN) / (TP + TN + FP + FN),

and Matthew’s correlation coefficient is given by

MCC = (TP × TN – FP × FN) / [(TP + FN)(TP + FP)(TN + FN)(TN + FP)]1/2.

Method
Q       
MCC          
RF (server attributes)
0.81       
0.40         
SVM (server attributes)
0.80       
0.33         
RF (RSA attribute) 0.81       
0.38         
Huang et al. (2007)
(CART)
0.80       
0.44         

For regression, model performance is evaluated by calculating the mean absolute error (MAE) between the predicted and experimental ΔΔGH2O values.

Method
MAE            
REPTree (server attributes)
1.06           
SVMreg (server attributes)
1.00           
REPTree (RSA attribute)
1.06           
Huang et al. (2007)
(CART)
1.37           

Next, Saraboji et al. (2006) reported an overall accuracy of 0.80, based on leave-one-out cross-validation (jackknife) applied in conjunction with their average assignment method. Similarly, application of the jackknife, in conjunction with RF learning and the use of the RSA attribute in the feature vectors representing the protein mutants in our modified dataset, resulted in an overall accuracy of 0.83.

Finally, an independent test set of 112 mutants (14 "+" and 98 "") in 34 protein structures was collected from the ProTherm database. None of the test set mutants appear in the training dataset, and 18 of the protein structures are unique to the test set. A validation study was performed, whereby the test set mutants were each blindly predicted by the server's classification models, with the results tabulated below.

Method
Q
S(+) P(+) S() P() BER
MCC
RF (server attributes) 0.91
0.71
0.63
0.94
0.96
0.17
0.62
SVM (server attributes) 0.88
0.71
0.53
0.91
0.96
0.19
0.55

A majority of the independent test set mutants (67 mutants: 9 "+" and 58 "") are associated with the 18 protein structures that are unique to the test set. For this particular subset, prediction results are Q = 0.88, BER = 0.26, and MCC = 0.49 using the RF model (8/67 mutants incorrectly predicted), and Q = 0.87, BER = 0.27, and MCC = 0.45 using the SVM model (9/67 mutants incorrectly predicted).

References
  1. Bava K.A., Gromiha M.M., Uedaira H., Kitajima K. & Sarai A. (2004) ProTherm, version 4.0: thermodynamic database for proteins and mutants, Nucleic Acids Res. 32, D120-D121.
  2. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N. & Bourne P.E. (2000) The Protein Data Bank, Nucleic Acids Res. 28, 235-242.
  3. Huang L.T., Saraboji K., Ho S.Y., Hwang S.F., Ponnuswamy M.N. & Gromiha M.M. (2007) Prediction of protein mutant stability using classification and regression tool, Biophys. Chem. 125, 462-470.
  4. Masso M. & Vaisman I.I. (2010) AUTO-MUTE: web-based tools for predicting stability changes in proteins due to single amino acid replacements, Protein Eng. Des. Sel. 23, 683-687.
  5. Masso M. & Vaisman I.I. (2008) Accurate prediction of stability changes in protein mutants combining machine learning with structure based computational mutagenesis, Bioinformatics 24, 2002-2009.
  6. Masso M. & Vaisman I.I. (2007) Accurate prediction of enzyme mutant activity based on a multibody statistical potential, Bioinformatics 23, 3155-3161.
  7. Masso M., Lu Z. & Vaisman I.I. (2006) Computational mutagenesis studies of protein structure-function correlations, Proteins 64, 234-245.
  8. Masso M. & Vaisman I.I. (2003) Comprehensive mutagenesis of HIV-1 protease: a computational geometry approach, Biochem. Biophys. Res. Comm. 305, 322-326.
  9. Saraboji K., Gromiha M.M. & Ponnuswamy M.N. (2006) Average assignment method for predicting the stability of protein mutants, Biopolymers 82, 80-92.