AUTOMUTE

AUTO
mated server for predicting...
      ...functional consequences of amino acid MUTations in protEins

tessellated_t4_lysozyme

 

AUTO-MUTE Home

Stability Changes (ΔΔG)

Stability Changes (ΔΔGH2O)

Stability Changes (ΔTm)

Activity Changes

Disease Potential of Human nsSNPs

Structural Bioinformatics at
George Mason University

Questions or Comments?
mmasso@gmu.edu

Stability Changes (ΔΔG) Documentation

Introduction

Stability Changes (ΔΔG) is an automated server for predicting the impact of single amino acid replacements on protein stability due to thermal denaturation. The models available were trained with a slightly modified version of a diverse mutant protein dataset, previously reported in Capriotti et al. (2005) and obtained by searching the ProTherm database. The original dataset consisted of 1948 single amino acid substitutions in 58 proteins with solved structures in the Protein Data Bank (PDB), and structures were chosen so that they were uniformly distributed among the the four major SCOP structural classifications. After removing mutants associated with two protein structures containing gaps, as well as mutants at positions with fewer than six nearest neighbors in tessellated protein structures, our modified dataset consists of 1925 single point mutants in 55 protein structures.

Depending upon the needs of the investigator, two supervised classification models (for predicting only the sign of ΔΔG) and two regression models (for predicting the actual value of ΔΔG) are available. The supervised classification models include Random Forest (RF) and Support Vector Machine (SVM), while the regression models include Tree regression (REPTree) and SVM regression (SVMreg). Similarly, the decision as to which of the two classification (or regression) models to select rests with the algorithmic preference of the researcher. Athough the two models of each type can be ranked based on various performance measures, which are detailed in the Results section below, these measures are relatively similar in magnitude and are not necessarily indicative of the predictive accuracy of the models on an independent test set of single point mutants that have yet to be experimentally investigated. Additionally, since the models were developed using implementations of four different machine learning algorithms, there likely will be occasions where predictions for a specific mutant are inconsistent among the models, especially in cases where the sign of ΔΔG is predicted with low confidence.

Lastly, predictions can only be performed for a mutant if it represents a single amino acid substitution in a tessellatable single chain of a solved protein structure with a coordinate file available in the PDB. Specifically, Delaunay tessellation of the structure can be performed if the PDB file contains consecutive primary sequence numbering in the ATOM lines (i.e., no gaps in the structure) starting with a non-negative integer, the alpha-carbon atomic coordinates are available for all the constituent amino acids, and no alternative conformations exist for the alpha-carbon atoms. In addition to X-ray structures, NMR structure files are potentially tessellatable if they consist of a single minimized average structure as opposed to multiple models.

Methods

Among the numerous factors influencing model performance are the dataset size and composition utilized for training, the type of attributes (i.e., predictors) used as components for the feature vectors characterizing the mutants in the dataset, and the machine learning algorithm chosen for model building. AUTO-MUTE utilizes attributes that include EC score at the mutated position (mutant protein residual score), ordered EC scores of the six nearest neighbors to the mutated position, native and replacement amino acid identities at the mutated position, ordered amino acid identities at the six nearest neighbors, and ordered differences between the primary sequence positions of the nearest neighbors and the mutated residue (see AUTO-MUTE home page for details).

In order for direct comparisons to be made with results obtained by
Capriotti et al. (2005), we initially included the following attributes for each single point mutant: relative solvent accessibility (RSA), as well as temperature and pH of the experimental conditions under which ΔΔG measurements were obtained. However, the models provided on this server for making predictions were trained by replacing RSA with the following Delaunay tessellation-derived attributes: mean volume and tetrahedrality of the simplices in which the mutated position serves as a vertex, location (surface, undersurface, or buried) of the mutated position, number of edge contacts that the mutated position has with surface positions, and secondary structure (helix, strand, coil, or turn) of the mutated position (see AUTO-MUTE home page for details). As described in the Results section below, there is a negligible difference in model performance as a result of such an alteration in the training set mutant feature vectors.

Required Inputs and Server Outputs

A valid PDB accession code and a specific chain (use @ if null) is required for the structure of the protein containing a single residue substitution whose impact on stability (
ΔΔG sign or value) is to be predicted. The mutation under consideration must be supplied in the form (native residue)(position number from PDB file ATOM lines)(replacement residue), for example D25E; however, by using an underscore "_" instead of a replacement residue, such as D25_ for example, predictions will be provided for all 19 amino acid substitutions at the requested position. The final inputs include the temperature (ºC, 0-100) and pH (0-14) conditions under which predictions are to be obtained.

In addition to reproducing the inputs, the output data includes either predicted sign of ΔΔG  along with a confidence level (classification) or predicted value of ΔΔG (regression), mean volume and tetrahedrality for the mutated position, location and number of edge contacts that the mutated position has with surface positions, and secondary structure of the mutated position.

Results

Based on the application of a 20-fold cross-validation procedure,
performance of the algorithms is evaluated by calculating the following values. In the case of supervised classification, each mutant belongs to either the “increased stability” or “+” class if experimental ΔΔG ≥ 0, or the “decreased stability” or “–”  class if ΔΔG < 0. With the understanding that TP (TN) = total number of correctly predicted “increased stability” (“decreased stability”) mutants, and FN (FP) = total number of respectively misclassified mutants, the overall accuracy is defined as
 
Q = (TP + TN) / (TP + TN + FP + FN).

Also, for the “increased stability” class,
 
S(+) = sensitivity = TP / (TP + FN) and P(+) = precision = TP / (TP + FP),

while for the “decreased stability” class,
 
S() = TN / (TN + FP) and P() = TN / (TN + FN).

Finally, the following two measures are calculated due to their robustness with respect to unequal class distributions: balanced error rate is defined as
 
BER = 0.5 × [FN / (FN + TP) + FP / (FP + TN)],

and Matthew’s correlation coefficient is given by

MCC = (TP × TN – FP × FN) / [(TP + FN)(TP + FP)(TN + FN)(TN + FP)]1/2.

Method
Q
S(+) P(+) S() P() BER
MCC
RF (server attributes)
0.86
0.69
0.81
0.93
0.88
0.19
0.65
SVM (server attributes)
0.83
0.69
0.74
0.90
0.87
0.21
0.60
RF (RSA attribute) 0.86
0.70
0.81
0.93
0.88
0.18
0.66
SVM (RSA attribute) 0.84
0.70
0.75
0.90
0.87
0.20
0.61
Capriotti et al. (2005)
(SVM, RSA attribute)
0.80
0.56
0.73
0.91
0.83
0.28
0.51

In the case of regression, model performance is evaluated by calculating the Pearson correlation coefficient (r) of the predicted and experimental ΔΔG values, the equation of the regression line, and the standard error.

Method
r
Standard Error
Regression Line
REPTree (server attributes)
0.79
1.1 kcal/mol
---
SVMreg (server attributes)
0.76
1.2 kcal/mol
---
REPTree (RSA attribute)
0.79
1.1 kcal/mol
y = 0.5357x – 0.4376
SVMreg (RSA attribute)
0.76
1.2 kcal/mol
y = 0.6287x – 0.3124
Capriotti et al. (2005)
(SVMreg, RSA attribute)
0.71
1.3 kcal/mol
y = 0.5223x – 0.4705

Finally, an independent test set of 142 mutants (20 "+" and 122 "") in 18 protein structures was collected from the ProTherm database. None of the test set mutants appear in the training dataset, and 14 of the protein structures are unique to the test set. A validation study was performed, whereby the test set mutants were each blindly predicted by the server's classification models, with the results tabulated below.

Method
Q
S(+) P(+) S() P() BER
MCC
RF (server attributes) 0.94
0.70
0.88
0.98
0.95
0.16
0.75
SVM (server attributes) 0.87
0.70
0.52
0.89
0.95
0.20
0.53

A majority of the independent test set mutants (134 mutants: 12 "+" and 122 "") are associated with the 14 protein structures that are unique to the test set. For this particular subset, prediction results are Q = 0.94, BER = 0.26, and MCC = 0.59 using the RF model (8/134 mutants incorrectly predicted), and Q = 0.87, BER = 0.26, and MCC = 0.38 using the SVM model (18/134 mutants incorrectly predicted).

References
  1. Bava K.A., Gromiha M.M., Uedaira H., Kitajima K. & Sarai A. (2004) ProTherm, version 4.0: thermodynamic database for proteins and mutants, Nucleic Acids Res. 32, D120-D121.
  2. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N. & Bourne P.E. (2000) The Protein Data Bank, Nucleic Acids Res. 28, 235-242.
  3. Capriotti E., Fariselli P. & Casadio R. (2005) I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure, Nucleic Acids Res. 33, W306-W310.
  4. Masso M. & Vaisman I.I. (2010) AUTO-MUTE: web-based tools for predicting stability changes in proteins due to single amino acid replacements, Protein Eng. Des. Sel. 23, 683-687.
  5. Masso M. & Vaisman I.I. (2008) Accurate prediction of stability changes in protein mutants combining machine learning with structure based computational mutagenesis, Bioinformatics 24, 2002-2009.
  6. Masso M. & Vaisman I.I. (2007) Accurate prediction of enzyme mutant activity based on a multibody statistical potential, Bioinformatics 23, 3155-3161.
  7. Masso M., Lu Z. & Vaisman I.I. (2006) Computational mutagenesis studies of protein structure-function correlations, Proteins 64, 234-245.
  8. Masso M. & Vaisman I.I. (2003) Comprehensive mutagenesis of HIV-1 protease: a computational geometry approach, Biochem. Biophys. Res. Comm. 305, 322-326.
  9. Murzin A.G., Brenner S.E., Hubbard T. & Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures, J. Mol. Biol. 247, 536-540.