Variant Effect Predictors
Computational approaches can leverage variant effect data for functional interpretation
Guidelines for Releasing a Variant Effect Predictor now available on arXiv >>Follow this link<< to see our extensive list of variant effect predictors, including classifications, links and references.
Are we missing something?
If there is a variant effect predictor you think should be added to the list, please tell us at this link: Google Form
What is a variant effect predictor?
A variant effect predictor (VEP) is a software tool that predicts the fitness effects of genetic variants. Compared to MAVE-style assays, VEPs are largely free, quick and easy to access although assessing the quality of the predictions made is a challenging prospect. While computational predictors can only provide supporting evidence towards variant classification under the 2015 ACMG/AMP guidelines [1], recent works have provided recommendations for a framework assigning stronger evidence strength to VEPs [2], [3].
What types of VEP are available?
This resource focuses primarily on VEPs that make predictions in protein-coding regions of the genome. There are currently over 100 such predictors available, with more being constantly developed and improved upon as new techniques and technologies become available. The past five years in particular have seen huge growth in the area of computational variant effect prediction.
The problem with having so many different methods to choose from, is that it can be difficult to select one for classification. Classic predictors like SIFT and PolyPhen-2 are often chosen due to their simplicity or integration into existing pipelines, but there are considerably better options available for most purposes. As a first step, we recommend looking at an independent (I.E. not a benchmark performed by a VEP author including their own method) benchmark of VEP performance or at the ProteinGym resource (https://proteingym.org/) [4].
Another important aspect to consider is how the VEP is trained. It has been shown that VEPs that draw their training data from clinical databases (such as ClinVar and HGMD) demonstrate superior performance on similar datasets, but are often less effective at predicting the outcome of functional studies [5], [6]. This data circularity has the potential to skew the results of benchmarks based on clinical data unless extreme care is taken in variant curation and VEP selection. The recent increase in benchmarks based on functional and gene-trait data aim to avoid this source of bias [7], [8]. We can separate VEPs based on their training data and processes into three categories, indicating their level of vulnerability to apparent performance inflation stemming from data circularity.
- Clinical trained VEPs are those trained using labelled clinical data from large databases such as ClinVar and/or include such VEPs as features. These VEPs are the most vulnerable to data circularity and often perform considerably better on clinical datasets than functional ones.
- Population tuned VEPs are not directly trained with clinical data, but may be exposed to it during tuning or optimisation procedures. These methods are moderately vulnerable to data circularity.
- Population free VEPs are not trained with any labelled clinical data and do not include any features derived from allele frequency. These VEPs should be largely immune from data circularity and perform consistently across both clinical and functional benchmarks.
We recommend using several top-performing VEPs with different methodology to generate a consensus prediction of variant effect. Although, to make use of the enhanced VEP evidence strengths in [2], only a single VEP should be chosen, ideally based on the range of evidence strengths it can attain.
Obtaining VEP predictions
Our resource provides, where possible, three channels through which predictions can be obtained from each VEP.
Many VEPs offer an online interface that allows the user to specify a variant, protein or even upload full VCF files. There is little commonality in the required format between predictors. If the interface simply queries a database of pre-calculated results, then predictions are returned quickly. If the online form runs the predictor itself, then results can take considerably longer, potentially several hours. Some predictors also have a web API, allowing them to be queried programmatically.
The majority of VEPs offer a pre-calculated download of all human coding variants either through their website or an external repository. While formats for these downloads vary, they are often extremely large CSV files indexed either by protein ID (UniProt/RefSeq) or by genomic coordinates. Note: a lot of older VEPs still use GRCH37 (hg19) coordinates instead of the newer GRCh38.
Finally, most predictors are also open source (please check the licence if you are planning to modify and redistribute!) and available to install and run as an end-user through Github or other sites. Having a local version of a predictor gives you full control over how it is run, however the more complex VEPs based on machine learning can be extremely computationally intensive to run, often requiring a high-end GPU.
Some helpful resources
ProteinGym – A frequently-updated benchmark with both clinical and functional (MAVE) components (https://proteingym.org/).
dbNSFP – A database of pre-calculated predictions from >40 VEPs and conservation metrics for potentially all non-sysnonymous SNVs in the human genome (https://www.dbnsfp.org/home).
VIPdb – A list of >400 VEPs (including all types – not just missense predictors) with links (https://genomeinterpretation.org/vipdb.html).
CAGI – The Critical assessment of genome interpretation (CAGI) community experiment aim to periodically assess and highlight progress in the area of variant effect prediction (https://genomeinterpretation.org/).
References
[1] S. Richards et al., ‘Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology’, Genet Med, vol. 17, no. 5, Art. no. 5, May 2015, doi: 10.1038/gim.2015.30.
[2] V. Pejaver et al., ‘Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria’, Am J Hum Genet, vol. 109, no. 12, pp. 2163–2177, Dec. 2022, doi: 10.1016/j.ajhg.2022.10.013.
[3] S. L. Stenton et al., ‘Assessment of the evidence yield for the calibrated PP3/BP4 computational recommendations’, Genetics in Medicine, vol. 26, no. 11, p. 101213, Nov. 2024, doi: 10.1016/j.gim.2024.101213.
[4] P. Notin et al., ‘ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design’, Advances in Neural Information Processing Systems, vol. 36, pp. 64331–64379, Dec. 2023.
[5] D. G. Grimm et al., ‘The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity’, Human Mutation, vol. 36, no. 5, pp. 513–523, 2015, doi: 10.1002/humu.22768.
[6] K. Mahmood et al., ‘Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics’, Hum Genomics, vol. 11, no. 1, p. 10, May 2017, doi: 10.1186/s40246-017-0104-8.
[7] D. R. Tabet et al., ‘Benchmarking computational variant effect predictors by their ability to infer human traits’, Genome Biology, vol. 25, no. 1, p. 172, Jul. 2024, doi: 10.1186/s13059-024-03314-7.
[8] B. J. Livesey and J. A. Marsh, ‘Variant effect predictor correlation with functional assays is reflective of clinical classification performance’, Dec. 13, 2024, bioRxiv. doi: 10.1101/2024.05.12.593741.
This resource was put together by Benjamin J. Livesey (Post doctoral student at the University of Edinburgh) and Joseph Marsh (AMP workstream chair and group leader at the MRC Human Genetics Unit at the University of Edinburgh) as part of the Analysis Modelling and Prediction (AMP) workstream efforts.
If you are interested in helping to develop and update this resource, please contact project manager Lara Muffley (muffley@uw.edu) or Benjamin J. Livesey (blivesey@exseed.ed.ac.uk)”