Default model change: from senseppi.ckpt to fly_worm_human_chicken.ckpt

parent 358ad745
......@@ -127,6 +127,8 @@ dmypy.json
/esm2_embs_3B
*.sh
draft.py
/data/string_species/mmseqs_dbs/
/data/string_species/mmseqs_dbs_orig/
/data/human_virus/all_test_viruses.csv
/esm2_backup
/data/string_species/mmseqs_dbs/
/data/string_species/mmseqs_dbs_fwh/
......@@ -33,15 +33,17 @@ the interactions are taken from the STRING database (based on seed proteins).
Predictions are compared with the STRING database. Optionally, the graphs can be constructed.
- `create_dataset`: creates a dataset from the STRING database based on the taxonomic ID of the organism.
The package already comes with one pretrained version of the model `fly_worm_human_chiken.ckpt` (checkpoint with weights) that is used by **default** if model path is not specified.
This model was trained on dataset that combined PPIs from D. melanogaster, C. elegans, H. sapiens and G. gallus, and it provides the best performance with respect to the other pretrained models.
The original SENSE-PPI repository contains two models (checkpoints with weights) pretrained on human PPIs: `senseppi.ckpt` and `dscript.ckpt` pretrained on SENSE-PPI and DSCRIPT human datasets respectively.
The original SENSE-PPI repository also contains two human-based models pretrained on human PPIs: `senseppi.ckpt` and `dscript.ckpt` pretrained on SENSE-PPI and DSCRIPT human datasets respectively.
- `senseppi.ckpt`: Download from [here](http://gitlab.lcqb.upmc.fr/Konstvv/SENSE-PPI/raw/master/pretrained_models/senseppi.ckpt)
- `dscript.ckpt` : Download from [here](http://gitlab.lcqb.upmc.fr/Konstvv/SENSE-PPI/raw/master/pretrained_models/dscript.ckpt)
The package already comes with preinstalled model `senseppi.ckpt` that is used by default if model path is not specified.
For information about the other models that can be found in the pretrained_models folder, please refer to the original article.
**N.B.**: Both pretrained models were made to work with proteins in range 50-800 amino acids.
**N.B.**: All pretrained models were made to work with proteins in range 50-800 amino acids.
In order to cite the original SENSE-PPI paper, please use the following link: https://doi.org/10.1101/2023.09.19.558413
......
......@@ -34,10 +34,9 @@ Predict
.. code-block:: bash
usage: senseppi <command> [<args>] predict [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [--pairs_file PAIRS_FILE]
[-o OUTPUT] [--with_self] [-p PRED_THRESHOLD] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM]
[--toks_per_batch_esm TOKS_PER_BATCH_ESM]
fasta_file
usage: senseppi <command> [<args>] predict [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [--pairs_file PAIRS_FILE] [-o OUTPUT] [--with_self] [-p PRED_THRESHOLD]
[--num_nodes NUM_NODES] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM] [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
fasta_file
positional arguments:
fasta_file FASTA file on which to extract the ESM2 representations and then test.
......@@ -48,29 +47,29 @@ Predict
--min_len MIN_LEN Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
--max_len MAX_LEN Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
--device {cpu,gpu,mps,auto}
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
be used instead. (Default: auto)
Predict args:
--model_path MODEL_PATH
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled senseppi.ckpt trained version is used. (Trained on human PPIs)
(Default: None)
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. (Trained on human PPIs) (Default: None)
--pairs_file PAIRS_FILE
A path to a .tsv file with pairs of proteins to test (Optional). If not provided, all-to-all pairs will be generated. (Default: None)
-o OUTPUT, --output OUTPUT
A path to a file where the predictions will be saved. (.tsv format will be added automatically) (Default: predictions)
--with_self Include self-interactions in the predictions.By default they are not included since they were not part of training but they can be included by setting this
flag to True.
--with_self Include self-interactions in the predictions.By default they are not included since they were not part of training but they can be included by setting this flag to True.
-p PRED_THRESHOLD, --pred_threshold PRED_THRESHOLD
Prediction threshold to determine interacting pairs that will be written to a separate file. Range: (0, 1). (Default: 0.5)
--num_nodes NUM_NODES
Number of nodes to use for launching on a cluster. (Default: 1)
Args_model:
--batch_size BATCH_SIZE
Batch size for training/testing. (Default: 32)
ESM2 model args:
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
sure that --output_dir_esm is set to the correct folder.
--output_dir_esm OUTPUT_DIR_ESM
output directory for extracted representations (Default: esm2_embs_3B)
......@@ -83,8 +82,8 @@ Test
.. code-block:: bash
usage: senseppi <command> [<args>] test [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [-o OUTPUT]
[--crop_data_to_model_lims] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM] [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
usage: senseppi <command> [<args>] test [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [-o OUTPUT] [--crop_data_to_model_lims] [--num_nodes NUM_NODES]
[--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM] [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
pairs_file fasta_file
positional arguments:
......@@ -97,26 +96,26 @@ Test
--min_len MIN_LEN Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
--max_len MAX_LEN Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
--device {cpu,gpu,mps,auto}
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
be used instead. (Default: auto)
Predict args:
--model_path MODEL_PATH
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled senseppi.ckpt trained version is used. (Trained on human PPIs)
(Default: None)
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. (Trained on human PPIs) (Default: None)
-o OUTPUT, --output OUTPUT
A path to a file where the test metrics will be saved. (.tsv format will be added automatically) (Default: test_metrics)
--crop_data_to_model_lims
If set, the data will be cropped to the limits of the model: evaluations will be done only for proteins >50aa and <800aa. WARNING: this will modify the
original input files.
If set, the data will be cropped to the limits of the model: evaluations will be done only for proteins >50aa and <800aa. WARNING: this will modify the original input files.
--num_nodes NUM_NODES
Number of nodes to use for launching on a cluster. (Default: 1)
Args_model:
--batch_size BATCH_SIZE
Batch size for training/testing. (Default: 32)
ESM2 model args:
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
sure that --output_dir_esm is set to the correct folder.
--output_dir_esm OUTPUT_DIR_ESM
output directory for extracted representations (Default: esm2_embs_3B)
......@@ -135,14 +134,12 @@ A dataset for training must be provided as two separate files:
.. code-block:: bash
usage: senseppi <command> [<args>] train [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--valid_size VALID_SIZE] [--seed SEED]
[--num_epochs NUM_EPOCHS] [--num_devices NUM_DEVICES] [--num_nodes NUM_NODES] [--early_stop EARLY_STOP] [--lr LR]
[--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM] [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
usage: senseppi <command> [<args>] train [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--valid_size VALID_SIZE] [--seed SEED] [--num_epochs NUM_EPOCHS] [--num_nodes NUM_NODES]
[--early_stop EARLY_STOP] [--lr LR] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM] [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
pairs_file fasta_file
positional arguments:
pairs_file A path to a .tsv file containing training pairs. Required format: 3 tab separated columns: first protein, second protein (protein names have to be present
in fasta_file), label (0 or 1).
pairs_file A path to a .tsv file containing training pairs. Required format: 3 tab separated columns: first protein, second protein (protein names have to be present in fasta_file), label (0 or 1).
fasta_file FASTA file on which to extract the ESM2 representations and then train.
options:
......@@ -151,8 +148,8 @@ A dataset for training must be provided as two separate files:
--min_len MIN_LEN Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
--max_len MAX_LEN Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
--device {cpu,gpu,mps,auto}
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
be used instead. (Default: auto)
Training args:
Arguments for training the model.
......@@ -162,12 +159,10 @@ A dataset for training must be provided as two separate files:
--seed SEED Global training seed. (Default: None)
--num_epochs NUM_EPOCHS
Number of training epochs. (Default: 100)
--num_devices NUM_DEVICES
Number of devices to use for multi GPU training. (Default: 1)
--num_nodes NUM_NODES
Number of nodes to use for training on a cluster. (Default: 1)
--early_stop EARLY_STOP
Number of epochs to wait before stopping the training (tracking is done with validation loss). By default, the is no early stopping. (Default: None)
Number of epochs to wait before stopping the training (tracking is done with validation loss). By default, the is no early stopping. (Default: 10)
Args_model:
--lr LR Learning rate for training. Cosine warmup will be applied. (Default: 0.0001)
......@@ -175,8 +170,8 @@ A dataset for training must be provided as two separate files:
Batch size for training/testing. (Default: 32)
ESM2 model args:
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
sure that --output_dir_esm is set to the correct folder.
--output_dir_esm OUTPUT_DIR_ESM
output directory for extracted representations (Default: esm2_embs_3B)
......@@ -189,9 +184,8 @@ Predict_string
.. code-block:: bash
usage: senseppi <command> [<args>] predict_string [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [-s SPECIES] [-n NODES]
[-r SCORE] [-p PRED_THRESHOLD] [--graphs] [-o OUTPUT] [--network_type {physical,functional}]
[--delete_proteins DELETE_PROTEINS [DELETE_PROTEINS ...]] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM]
usage: senseppi <command> [<args>] predict_string [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [-s SPECIES] [-n NODES] [-r SCORE] [-p PRED_THRESHOLD] [--graphs]
[-o OUTPUT] [--network_type {physical,functional}] [--delete_proteins DELETE_PROTEINS [DELETE_PROTEINS ...]] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM]
[--toks_per_batch_esm TOKS_PER_BATCH_ESM]
genes [genes ...]
......@@ -204,13 +198,12 @@ Predict_string
--min_len MIN_LEN Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
--max_len MAX_LEN Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
--device {cpu,gpu,mps,auto}
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
be used instead. (Default: auto)
General options:
--model_path MODEL_PATH
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled senseppi.ckpt trained version is used. (Trained on human PPIs)
(Default: None)
A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. (Trained on human PPIs) (Default: None)
-s SPECIES, --species SPECIES
Species from STRING database. Default: H. Sapiens (Default: 9606)
-n NODES, --nodes NODES
......@@ -232,8 +225,8 @@ Predict_string
Batch size for training/testing. (Default: 32)
ESM2 model args:
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
sure that --output_dir_esm is set to the correct folder.
--output_dir_esm OUTPUT_DIR_ESM
output directory for extracted representations (Default: esm2_embs_3B)
......@@ -246,9 +239,8 @@ Create_dataset
.. code-block:: bash
usage: senseppi <command> [<args>] create_dataset [-h] [--interactions INTERACTIONS] [--sequences SEQUENCES] [--not_remove_long_short_proteins] [--min_length MIN_LENGTH]
[--max_length MAX_LENGTH] [--max_positive_pairs MAX_POSITIVE_PAIRS] [--combined_score COMBINED_SCORE]
[--experimental_score EXPERIMENTAL_SCORE]
usage: senseppi <command> [<args>] create_dataset [-h] [--interactions INTERACTIONS] [--sequences SEQUENCES] [--not_remove_long_short_proteins] [--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
[--max_positive_pairs MAX_POSITIVE_PAIRS] [--combined_score COMBINED_SCORE] [--experimental_score EXPERIMENTAL_SCORE]
species
positional arguments:
......@@ -267,10 +259,9 @@ Create_dataset
--max_length MAX_LENGTH
The maximum length of a protein to be included in the dataset. (Default: 800)
--max_positive_pairs MAX_POSITIVE_PAIRS
The maximum number of positive pairs to be included in the dataset. If None, all pairs are included. If specified, the pairs are selected based on the
combined score in STRING. (Default: None)
The maximum number of positive pairs to be included in the dataset. If None, all pairs are included. If specified, the pairs are selected based on the combined score in STRING. (Default: None)
--combined_score COMBINED_SCORE
The combined score threshold for the pairs extracted from STRING. Ranges from 0 to 1000. (Default: 500)
--experimental_score EXPERIMENTAL_SCORE
The experimental score threshold for the pairs extracted from STRING. Ranges from 0 to 1000. Default is None, which means that the experimental score is
not used. (Default: None)
The experimental score threshold for the pairs extracted from STRING. Ranges from 0 to 1000. Default is None, which means that the experimental score is not used. (Default: None)
__version__ = "0.6.1"
__version__ = "0.6.2"
__author__ = "Konstantin Volzhenin"
from . import model, commands, esm2_model, dataset, utils, network_utils
......
......@@ -71,7 +71,7 @@ def add_args(parser):
)
predict_args.add_argument("--model_path", type=str, default=None,
help="A path to .ckpt file that contains weights to a pretrained model. If "
"None, the preinstalled senseppi.ckpt trained version is used. "
"None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. "
"(Trained on human PPIs)")
predict_args.add_argument("--pairs_file", type=str, default=None,
help="A path to a .tsv file with pairs of proteins to test (Optional). If not provided, "
......
......@@ -173,7 +173,7 @@ def add_args(parser):
"typed (separated by whitespaces).")
string_pred_args.add_argument("--model_path", type=str, default=None,
help="A path to .ckpt file that contains weights to a pretrained model. If "
"None, the preinstalled senseppi.ckpt trained version is used. "
"None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. "
"(Trained on human PPIs)")
string_pred_args.add_argument("-s", "--species", type=int, default=9606,
help="Species from STRING database. Default: H. Sapiens")
......
......@@ -47,7 +47,7 @@ def add_args(parser):
)
test_args.add_argument("--model_path", type=str, default=None,
help="A path to .ckpt file that contains weights to a pretrained model. If "
"None, the preinstalled senseppi.ckpt trained version is used. "
"None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. "
"(Trained on human PPIs)")
test_args.add_argument("-o", "--output", type=str, default="test_metrics",
help="A path to a file where the test metrics will be saved. "
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment