@@ -33,15 +33,17 @@ the interactions are taken from the STRING database (based on seed proteins).
Predictions are compared with the STRING database. Optionally, the graphs can be constructed.
-`create_dataset`: creates a dataset from the STRING database based on the taxonomic ID of the organism.
The package already comes with one pretrained version of the model `fly_worm_human_chiken.ckpt` (checkpoint with weights) that is used by **default** if model path is not specified.
This model was trained on dataset that combined PPIs from D. melanogaster, C. elegans, H. sapiens and G. gallus, and it provides the best performance with respect to the other pretrained models.
The original SENSE-PPI repository contains two models (checkpoints with weights) pretrained on human PPIs: `senseppi.ckpt` and `dscript.ckpt` pretrained on SENSE-PPI and DSCRIPT human datasets respectively.
The original SENSE-PPI repository also contains two human-based models pretrained on human PPIs: `senseppi.ckpt` and `dscript.ckpt` pretrained on SENSE-PPI and DSCRIPT human datasets respectively.
-`senseppi.ckpt`: Download from [here](http://gitlab.lcqb.upmc.fr/Konstvv/SENSE-PPI/raw/master/pretrained_models/senseppi.ckpt)
-`dscript.ckpt` : Download from [here](http://gitlab.lcqb.upmc.fr/Konstvv/SENSE-PPI/raw/master/pretrained_models/dscript.ckpt)
The package already comes with preinstalled model `senseppi.ckpt` that is used by default if model path is not specified.
For information about the other models that can be found in the pretrained_models folder, please refer to the original article.
**N.B.**: Both pretrained models were made to work with proteins in range 50-800 amino acids.
**N.B.**: All pretrained models were made to work with proteins in range 50-800 amino acids.
In order to cite the original SENSE-PPI paper, please use the following link: https://doi.org/10.1101/2023.09.19.558413
fasta_file FASTA file on which to extract the ESM2 representations and then test.
...
...
@@ -48,29 +47,29 @@ Predict
--min_len MIN_LEN Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
--max_len MAX_LEN Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
--device {cpu,gpu,mps,auto}
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
be used instead. (Default: auto)
Predict args:
--model_path MODEL_PATH
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled senseppi.ckpt trained version is used. (Trained on human PPIs)
(Default: None)
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. (Trained on human PPIs) (Default: None)
--pairs_file PAIRS_FILE
A path to a .tsv file with pairs of proteins to test (Optional). If not provided, all-to-all pairs will be generated. (Default: None)
-o OUTPUT, --output OUTPUT
A path to a file where the predictions will be saved. (.tsv format will be added automatically) (Default: predictions)
--with_self Include self-interactions in the predictions.By default they are not included since they were not part of training but they can be included by setting this
flag to True.
--with_self Include self-interactions in the predictions.By default they are not included since they were not part of training but they can be included by setting this flag to True.
Prediction threshold to determine interacting pairs that will be written to a separate file. Range: (0, 1). (Default: 0.5)
--num_nodes NUM_NODES
Number of nodes to use for launching on a cluster. (Default: 1)
Args_model:
--batch_size BATCH_SIZE
Batch size for training/testing. (Default: 32)
ESM2 model args:
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
sure that --output_dir_esm is set to the correct folder.
--output_dir_esm OUTPUT_DIR_ESM
output directory for extracted representations (Default: esm2_embs_3B)
--min_len MIN_LEN Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
--max_len MAX_LEN Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
--device {cpu,gpu,mps,auto}
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
be used instead. (Default: auto)
Predict args:
--model_path MODEL_PATH
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled senseppi.ckpt trained version is used. (Trained on human PPIs)
(Default: None)
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. (Trained on human PPIs) (Default: None)
-o OUTPUT, --output OUTPUT
A path to a file where the test metrics will be saved. (.tsv format will be added automatically) (Default: test_metrics)
--crop_data_to_model_lims
If set, the data will be cropped to the limits of the model: evaluations will be done only for proteins >50aa and <800aa. WARNING: this will modify the
original input files.
If set, the data will be cropped to the limits of the model: evaluations will be done only for proteins >50aa and <800aa. WARNING: this will modify the original input files.
--num_nodes NUM_NODES
Number of nodes to use for launching on a cluster. (Default: 1)
Args_model:
--batch_size BATCH_SIZE
Batch size for training/testing. (Default: 32)
ESM2 model args:
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
sure that --output_dir_esm is set to the correct folder.
--output_dir_esm OUTPUT_DIR_ESM
output directory for extracted representations (Default: esm2_embs_3B)
...
...
@@ -135,14 +134,12 @@ A dataset for training must be provided as two separate files:
pairs_file A path to a .tsv file containing training pairs. Required format: 3 tab separated columns: first protein, second protein (protein names have to be present
in fasta_file), label (0 or 1).
pairs_file A path to a .tsv file containing training pairs. Required format: 3 tab separated columns: first protein, second protein (protein names have to be present in fasta_file), label (0 or 1).
fasta_file FASTA file on which to extract the ESM2 representations and then train.
options:
...
...
@@ -151,8 +148,8 @@ A dataset for training must be provided as two separate files:
--min_len MIN_LEN Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
--max_len MAX_LEN Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
--device {cpu,gpu,mps,auto}
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
be used instead. (Default: auto)
Training args:
Arguments for training the model.
...
...
@@ -162,12 +159,10 @@ A dataset for training must be provided as two separate files:
--seed SEED Global training seed. (Default: None)
--num_epochs NUM_EPOCHS
Number of training epochs. (Default: 100)
--num_devices NUM_DEVICES
Number of devices to use for multi GPU training. (Default: 1)
--num_nodes NUM_NODES
Number of nodes to use for training on a cluster. (Default: 1)
--early_stop EARLY_STOP
Number of epochs to wait before stopping the training (tracking is done with validation loss). By default, the is no early stopping. (Default: None)
Number of epochs to wait before stopping the training (tracking is done with validation loss). By default, the is no early stopping. (Default: 10)
Args_model:
--lr LR Learning rate for training. Cosine warmup will be applied. (Default: 0.0001)
...
...
@@ -175,8 +170,8 @@ A dataset for training must be provided as two separate files:
Batch size for training/testing. (Default: 32)
ESM2 model args:
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
sure that --output_dir_esm is set to the correct folder.
--output_dir_esm OUTPUT_DIR_ESM
output directory for extracted representations (Default: esm2_embs_3B)
--min_len MIN_LEN Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
--max_len MAX_LEN Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
--device {cpu,gpu,mps,auto}
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
be used instead. (Default: auto)
General options:
--model_path MODEL_PATH
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled senseppi.ckpt trained version is used. (Trained on human PPIs)
(Default: None)
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. (Trained on human PPIs) (Default: None)
-s SPECIES, --species SPECIES
Species from STRING database. Default: H. Sapiens (Default: 9606)
-n NODES, --nodes NODES
...
...
@@ -232,8 +225,8 @@ Predict_string
Batch size for training/testing. (Default: 32)
ESM2 model args:
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
sure that --output_dir_esm is set to the correct folder.
--output_dir_esm OUTPUT_DIR_ESM
output directory for extracted representations (Default: esm2_embs_3B)
The maximum length of a protein to be included in the dataset. (Default: 800)
--max_positive_pairs MAX_POSITIVE_PAIRS
The maximum number of positive pairs to be included in the dataset. If None, all pairs are included. If specified, the pairs are selected based on the
combined score in STRING. (Default: None)
The maximum number of positive pairs to be included in the dataset. If None, all pairs are included. If specified, the pairs are selected based on the combined score in STRING. (Default: None)
--combined_score COMBINED_SCORE
The combined score threshold for the pairs extracted from STRING. Ranges from 0 to 1000. (Default: 500)
--experimental_score EXPERIMENTAL_SCORE
The experimental score threshold for the pairs extracted from STRING. Ranges from 0 to 1000. Default is None, which means that the experimental score is
not used. (Default: None)
The experimental score threshold for the pairs extracted from STRING. Ranges from 0 to 1000. Default is None, which means that the experimental score is not used. (Default: None)