Introduction

What is PRESCOTT?

PRESCOTT (PRESCOTT: Population awaRe Epistatic and StruCtural mOdel of muTational effecTs) is a package predicting mutational effects in a protein based on population, evolutionary and structural information. It is made up of two main programs: escott and prescott.

ESCOTT can calculate effects of single point mutations and multiple point mutations. On the other hand, PRESCOTT incorporates population frequencies into ESCOTT predictions. Therefore, you need to run ESCOTT first to have predictions of mutational effects. We recommend using PRESCOTT package via our web site or our docker image due to its dependencies.

Input Data Requirements

Input Data Requirements for escott

escott requires two files:

  • a multiple sequence alignment (MSA) file in fasta format (mandatory):

    • your query protein must be the first sequence in the fasta file. In addition, the query sequence should not contain any gaps.

  • a structure file in PDB format (optional but highly recommended).

One of the fastest ways to obtain both input MSA and a PDB file is to run colabfold: https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb

Please note that the MSA file produced by colabfold (a3m file) can contain gaps in the query sequence. You have to remove them before using it in PRESCOTT. You can remove the gaps with pragrams that have a GUI, such as ugene (http://ugene.net/) or jalview (https://www.jalview.org/).

For testing purpose, you can find some example input files for BLAT protein in data/ folder of this repository.

Input Data Requirements for prescott

prescott requires three files:

  • output file of escott (the file ending with …normPredCombi.txt)

  • a fasta file containing only your query sequence

  • gnomad csv file containing to be downloaded from https://gnomad.broadinstitute.org/ for your protein.

Usage

You can find example bash scripts for escott and prescott in examples folder of this repository.

Below, you will find examples of the most basic usage. Consult to the documentation for further details.

Running the escott program

Let’s assume that our input MSA is inputAli.fasta and input.pdb is our structure file in PDB format.

Run the program by issuing the following command in a bash terminal:

escott inputAli.fasta --pdbfile input.pdb

A quick help can be accessed by typing

escott --help

By default, ESCOTT will predict the effect of all possible single mutations at all positions in the query sequence. Alternatively, a set of single or multiple mutations can be given with the option -m. Each line of the file should contain a mutation (e.g. D136R) or combination of mutations separated by colons and ordered according to their positions in the sequence (e.g. D136R,V271A).

Running the prescott program

A quick help can be accessed by typing

prescott --help

Run the program by issuing the following command in a bash terminal:

prescott -e ../data/MLH1_normPred_evolCombi.txt -g ../data/gnomAD_v4.0.0_MLH1_HUMAN_ENSG00000076242.csv -s ../data/MLH1.fasta

GnomAD v4.0.0 is the most comprehensive, publicly available human population dataset as far as we know. However, if you would like to use GnomAD v2.1.1, you should specify the version with ‘–gnomadversion’ parameter as below:

prescott -e ../data/MLH1_normPred_evolCombi.txt -g ../data/gnomAD_v2.1.1_MLH1_HUMAN_ENSG00000076242.csv -s ../data/MLH1.fasta --gnomadversion 2

The most important output is prescott-scores.csv file, which produces entire single point mutational landscape for the protein.

In addition, there is a file called prescott-scores-details.csv. The file contains all information about the points modulated by population information coming from gnomad file and non-modulated variants.

Finally, if you have both pathogenic and benign labels in the gnomad file, there will be a ‘clinvar-vs-position.png’ file showing how these labeled variants are affected by population information.

Please note that the example input files of MLH1 protein for prescott acalculations are in the data directory of this repository.

Installation

PRESCOTT is implemented in Python 3 and R. It has been tested only on Linux. Since PRESCOTT has many dependencies, we recommend using our web site or our docker image. If you are a determined user, you can find the steps required to install it from the source in the following link (or in the docs folder of this repository):

Citation

Mustafa Tekpinar, Thomas Henry, Alessandra Carbone. PRESCOTT: a population aware, epistatic and structural model accurately predicts missense effect.