Preparing Your Own Input

Preparing Your Input MSA and PDB with Colabfold

You have a fasta file for your protein of interest and you want to understand impact of (certain) mutations. Before starting, please make sure that your fasta file does not contain a gap. The quickest method to obtain both multiple sequence alignment and a protein structure is to use Colabfold. Let’s do this step by step:

  1. Let’s go the Colabfold web site:

    https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb

    Sign in using your gmail account.

  2. Click on the ‘Connect’ button on the top right hand side.

  3. Clean ‘query_sequence’ box and paste your sequence to the ‘query_sequence’ box. For me, I selected adenylate kinase (AKE) as my example fasta sequence (https://www.rcsb.org/fasta/entry/4AKE/display).

  4. Change the ‘jobname’ to something that makes more sense to you.

  5. Go to the menu bar of your ‘AlphaFold2.ipynb’ notebook, where ‘File, Edit, View, Insert, Runtime, Tools, Help’ are listed. Click on the Runtime and select ‘Run all’.

  6. This process make take from a few minutes to a few hours depending on your protein size. It will give you an a3m file and up to 5 PDB models. Put these files in a clean folder and change the directory to that folder in your terminal.

  7. Unfortunately, a3m file is not in fasta format and it contains gap columns. We have to clean those gaps. We can do that with a GUI program like Ugene or Jalview. However, it is a labor intensive procedure. Here, I will use a small tool that I developed and added to the PRESCOTT docker image that I created.

  8. Start the docker image with the following command:

    sudo docker run -ti --rm --mount type=bind,source=$PWD,target=/home/tekpinar/research/myexample \
    tekpinar/prescott-docker:v1.5.0
    
  9. Now, change the directory to myexample folder.

    cd ../myexample/
    ls -l
    

    We are supposed to see our a3m and pdb files in this folder.

  10. Let’s use a small script from hhsuite to convert a3m file to fasta format.

    reformat.pl a3m fas AKE.a3m AKE.fasta
    
  11. Final step and we are there:

    demust removegaps -i AKE.fasta -o AKE_nogaps.fasta
    

There is one last step to reach our goal. ID and description parts of the a3m and fasta files are too long. We have to shorten them. We can do that with

awk 'BEGIN{FS=" "}{if(NF>1) {printf(">%s\n", $1)}else{print $0}}' AKE_nogaps.fasta > AKE_nogaps_short_names.fasta
# Recheck this command if you can remove extra >

Congratulations! Now, you have all the input files required for PRESCOTT:

  1. An input MSA: AKE_nogaps_short_names.fasta

  2. An input PDB: myprotein.pdb