Default model change: from senseppi.ckpt to fly_worm_human_chicken.ckpt

7fc6206a · Konstantin Volzhenin · 358ad745 · 7fc6206a · 7fc6206a · 7fc6206a
Commit 7fc6206a authored Dec 18, 2023 by Konstantin Volzhenin
7 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -127,6 +127,8 @@ dmypy.json
 /esm2_embs_3B
 *.sh
 draft.py
-/data/string_species/mmseqs_dbs/
+/data/string_species/mmseqs_dbs_orig/
 /data/human_virus/all_test_viruses.csv
 /esm2_backup
+/data/string_species/mmseqs_dbs/
+/data/string_species/mmseqs_dbs_fwh/
--- a/README.md
+++ b/README.md
@@ -33,15 +33,17 @@ the interactions are taken from the STRING database (based on seed proteins).
 Predictions are compared with the STRING database. Optionally, the graphs can be constructed.
 - `create_dataset`: creates a dataset from the STRING database based on the taxonomic ID of the organism.

+The package already comes with one pretrained version of the model `fly_worm_human_chiken.ckpt` (checkpoint with weights) that is used by **default** if model path is not specified. 
+This model was trained on dataset that combined PPIs from D. melanogaster, C. elegans, H. sapiens and G. gallus, and it provides the best performance with respect to the other pretrained models.

-The original SENSE-PPI repository contains two models (checkpoints with weights) pretrained on human PPIs: `senseppi.ckpt` and `dscript.ckpt` pretrained on SENSE-PPI and DSCRIPT human datasets respectively.
+The original SENSE-PPI repository also contains two human-based models pretrained on human PPIs: `senseppi.ckpt` and `dscript.ckpt` pretrained on SENSE-PPI and DSCRIPT human datasets respectively.

 - `senseppi.ckpt`: Download from [here](http://gitlab.lcqb.upmc.fr/Konstvv/SENSE-PPI/raw/master/pretrained_models/senseppi.ckpt)
 - `dscript.ckpt` : Download from [here](http://gitlab.lcqb.upmc.fr/Konstvv/SENSE-PPI/raw/master/pretrained_models/dscript.ckpt)

-The package already comes with preinstalled model `senseppi.ckpt` that is used by default if model path is not specified.
+For information about the other models that can be found in the pretrained_models folder, please refer to the original article.

-**N.B.**: Both pretrained models were made to work with proteins in range 50-800 amino acids.
+**N.B.**: All pretrained models were made to work with proteins in range 50-800 amino acids.

 In order to cite the original SENSE-PPI paper, please use the following link: https://doi.org/10.1101/2023.09.19.558413  


--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
@@ -34,9 +34,8 @@ Predict

 .. code-block:: bash

-    usage: senseppi <command> [<args>] predict [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [--pairs_file PAIRS_FILE]
-                                           [-o OUTPUT] [--with_self] [-p PRED_THRESHOLD] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM]
-                                           [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
+    usage: senseppi <command> [<args>] predict [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [--pairs_file PAIRS_FILE] [-o OUTPUT] [--with_self] [-p PRED_THRESHOLD]
+                                               [--num_nodes NUM_NODES] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM] [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
                                               fasta_file

    positional arguments:
@@ -48,29 +47,29 @@ Predict
      --min_len MIN_LEN     Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
      --max_len MAX_LEN     Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
      --device {cpu,gpu,mps,auto}
-                            Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
-                            is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
+                            Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
+                            be used instead. (Default: auto)

    Predict args:
      --model_path MODEL_PATH
-                            A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled senseppi.ckpt trained version is used. (Trained on human PPIs)
-                            (Default: None)
+                            A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. (Trained on human PPIs) (Default: None)
      --pairs_file PAIRS_FILE
                            A path to a .tsv file with pairs of proteins to test (Optional). If not provided, all-to-all pairs will be generated. (Default: None)
      -o OUTPUT, --output OUTPUT
                            A path to a file where the predictions will be saved. (.tsv format will be added automatically) (Default: predictions)
-      --with_self           Include self-interactions in the predictions.By default they are not included since they were not part of training but they can be included by setting this
-                            flag to True.
+      --with_self           Include self-interactions in the predictions.By default they are not included since they were not part of training but they can be included by setting this flag to True.
      -p PRED_THRESHOLD, --pred_threshold PRED_THRESHOLD
                            Prediction threshold to determine interacting pairs that will be written to a separate file. Range: (0, 1). (Default: 0.5)
+      --num_nodes NUM_NODES
+                            Number of nodes to use for launching on a cluster. (Default: 1)

    Args_model:
      --batch_size BATCH_SIZE
                            Batch size for training/testing. (Default: 32)

    ESM2 model args:
-      ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
-      multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
+      ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
+      sure that --output_dir_esm is set to the correct folder.

      --output_dir_esm OUTPUT_DIR_ESM
                            output directory for extracted representations (Default: esm2_embs_3B)
@@ -83,8 +82,8 @@ Test

 .. code-block:: bash

-    usage: senseppi <command> [<args>] test [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [-o OUTPUT]
-                                            [--crop_data_to_model_lims] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM] [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
+    usage: senseppi <command> [<args>] test [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [-o OUTPUT] [--crop_data_to_model_lims] [--num_nodes NUM_NODES]
+                                            [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM] [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
                                            pairs_file fasta_file

    positional arguments:
@@ -97,26 +96,26 @@ Test
      --min_len MIN_LEN     Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
      --max_len MAX_LEN     Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
      --device {cpu,gpu,mps,auto}
-                            Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
-                            is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
+                            Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
+                            be used instead. (Default: auto)

    Predict args:
      --model_path MODEL_PATH
-                            A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled senseppi.ckpt trained version is used. (Trained on human PPIs)
-                            (Default: None)
+                            A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. (Trained on human PPIs) (Default: None)
      -o OUTPUT, --output OUTPUT
                            A path to a file where the test metrics will be saved. (.tsv format will be added automatically) (Default: test_metrics)
      --crop_data_to_model_lims
-                            If set, the data will be cropped to the limits of the model: evaluations will be done only for proteins >50aa and <800aa. WARNING: this will modify the
-                            original input files.
+                            If set, the data will be cropped to the limits of the model: evaluations will be done only for proteins >50aa and <800aa. WARNING: this will modify the original input files.
+      --num_nodes NUM_NODES
+                            Number of nodes to use for launching on a cluster. (Default: 1)

    Args_model:
      --batch_size BATCH_SIZE
                            Batch size for training/testing. (Default: 32)

    ESM2 model args:
-      ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
-      multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
+      ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
+      sure that --output_dir_esm is set to the correct folder.

      --output_dir_esm OUTPUT_DIR_ESM
                            output directory for extracted representations (Default: esm2_embs_3B)
@@ -135,14 +134,12 @@ A dataset for training must be provided as two separate files:

 .. code-block:: bash

-    usage: senseppi <command> [<args>] train [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--valid_size VALID_SIZE] [--seed SEED]
-                                             [--num_epochs NUM_EPOCHS] [--num_devices NUM_DEVICES] [--num_nodes NUM_NODES] [--early_stop EARLY_STOP] [--lr LR]
-                                             [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM] [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
+    usage: senseppi <command> [<args>] train [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--valid_size VALID_SIZE] [--seed SEED] [--num_epochs NUM_EPOCHS] [--num_nodes NUM_NODES]
+                                             [--early_stop EARLY_STOP] [--lr LR] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM] [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
                                             pairs_file fasta_file

    positional arguments:
-      pairs_file            A path to a .tsv file containing training pairs. Required format: 3 tab separated columns: first protein, second protein (protein names have to be present
-                            in fasta_file), label (0 or 1).
+      pairs_file            A path to a .tsv file containing training pairs. Required format: 3 tab separated columns: first protein, second protein (protein names have to be present in fasta_file), label (0 or 1).
      fasta_file            FASTA file on which to extract the ESM2 representations and then train.

    options:
@@ -151,8 +148,8 @@ A dataset for training must be provided as two separate files:
      --min_len MIN_LEN     Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
      --max_len MAX_LEN     Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
      --device {cpu,gpu,mps,auto}
-                            Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
-                            is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
+                            Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
+                            be used instead. (Default: auto)

    Training args:
      Arguments for training the model.
@@ -162,12 +159,10 @@ A dataset for training must be provided as two separate files:
      --seed SEED           Global training seed. (Default: None)
      --num_epochs NUM_EPOCHS
                            Number of training epochs. (Default: 100)
-      --num_devices NUM_DEVICES
-                            Number of devices to use for multi GPU training. (Default: 1)
      --num_nodes NUM_NODES
                            Number of nodes to use for training on a cluster. (Default: 1)
      --early_stop EARLY_STOP
-                            Number of epochs to wait before stopping the training (tracking is done with validation loss). By default, the is no early stopping. (Default: None)
+                            Number of epochs to wait before stopping the training (tracking is done with validation loss). By default, the is no early stopping. (Default: 10)

    Args_model:
      --lr LR               Learning rate for training. Cosine warmup will be applied. (Default: 0.0001)
@@ -175,8 +170,8 @@ A dataset for training must be provided as two separate files:
                            Batch size for training/testing. (Default: 32)

    ESM2 model args:
-      ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
-      multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
+      ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
+      sure that --output_dir_esm is set to the correct folder.

      --output_dir_esm OUTPUT_DIR_ESM
                            output directory for extracted representations (Default: esm2_embs_3B)
@@ -189,9 +184,8 @@ Predict_string

 .. code-block:: bash

-    usage: senseppi <command> [<args>] predict_string [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [-s SPECIES] [-n NODES]
-                                                      [-r SCORE] [-p PRED_THRESHOLD] [--graphs] [-o OUTPUT] [--network_type {physical,functional}]
-                                                      [--delete_proteins DELETE_PROTEINS [DELETE_PROTEINS ...]] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM]
+    usage: senseppi <command> [<args>] predict_string [-h] [-v] [--min_len MIN_LEN] [--max_len MAX_LEN] [--device {cpu,gpu,mps,auto}] [--model_path MODEL_PATH] [-s SPECIES] [-n NODES] [-r SCORE] [-p PRED_THRESHOLD] [--graphs]
+                                                      [-o OUTPUT] [--network_type {physical,functional}] [--delete_proteins DELETE_PROTEINS [DELETE_PROTEINS ...]] [--batch_size BATCH_SIZE] [--output_dir_esm OUTPUT_DIR_ESM]
                                                      [--toks_per_batch_esm TOKS_PER_BATCH_ESM]
                                                      genes [genes ...]

@@ -204,13 +198,12 @@ Predict_string
      --min_len MIN_LEN     Minimum length of the protein sequence. The sequences with smaller length will not be considered and will be deleted from the fasta file. (Default: 50)
      --max_len MAX_LEN     Maximum length of the protein sequence. The sequences with larger length will not be considered and will be deleted from the fasta file. (Default: 800)
      --device {cpu,gpu,mps,auto}
-                            Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps
-                            is temporarily disabled, if it is chosen, cpu will be used instead. (Default: auto)
+                            Device to use for computations. Options include: cpu, gpu, mps (for MacOS), and auto.If not selected the device is set by torch automatically. WARNING: mps is temporarily disabled, if it is chosen, cpu will
+                            be used instead. (Default: auto)

    General options:
      --model_path MODEL_PATH
-                            A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled senseppi.ckpt trained version is used. (Trained on human PPIs)
-                            (Default: None)
+                            A path to .ckpt file that contains weights to a pretrained model. If None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. (Trained on human PPIs) (Default: None)
      -s SPECIES, --species SPECIES
                            Species from STRING database. Default: H. Sapiens (Default: 9606)
      -n NODES, --nodes NODES
@@ -232,8 +225,8 @@ Predict_string
                            Batch size for training/testing. (Default: 32)

    ESM2 model args:
-      ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in
-      multiple runs. In order to reuse the embeddings, make sure that --output_dir_esm is set to the correct folder.
+      ESM2: Extract per-token representations and model outputs for sequences in a FASTA file. The representations are saved in --output_dir_esm folder so they can be reused in multiple runs. In order to reuse the embeddings, make
+      sure that --output_dir_esm is set to the correct folder.

      --output_dir_esm OUTPUT_DIR_ESM
                            output directory for extracted representations (Default: esm2_embs_3B)
@@ -246,9 +239,8 @@ Create_dataset

 .. code-block:: bash

-    usage: senseppi <command> [<args>] create_dataset [-h] [--interactions INTERACTIONS] [--sequences SEQUENCES] [--not_remove_long_short_proteins] [--min_length MIN_LENGTH]
-                                                      [--max_length MAX_LENGTH] [--max_positive_pairs MAX_POSITIVE_PAIRS] [--combined_score COMBINED_SCORE]
-                                                      [--experimental_score EXPERIMENTAL_SCORE]
+    usage: senseppi <command> [<args>] create_dataset [-h] [--interactions INTERACTIONS] [--sequences SEQUENCES] [--not_remove_long_short_proteins] [--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
+                                                      [--max_positive_pairs MAX_POSITIVE_PAIRS] [--combined_score COMBINED_SCORE] [--experimental_score EXPERIMENTAL_SCORE]
                                                      species

    positional arguments:
@@ -267,10 +259,9 @@ Create_dataset
      --max_length MAX_LENGTH
                            The maximum length of a protein to be included in the dataset. (Default: 800)
      --max_positive_pairs MAX_POSITIVE_PAIRS
-                            The maximum number of positive pairs to be included in the dataset. If None, all pairs are included. If specified, the pairs are selected based on the
-                            combined score in STRING. (Default: None)
+                            The maximum number of positive pairs to be included in the dataset. If None, all pairs are included. If specified, the pairs are selected based on the combined score in STRING. (Default: None)
      --combined_score COMBINED_SCORE
                            The combined score threshold for the pairs extracted from STRING. Ranges from 0 to 1000. (Default: 500)
      --experimental_score EXPERIMENTAL_SCORE
-                            The experimental score threshold for the pairs extracted from STRING. Ranges from 0 to 1000. Default is None, which means that the experimental score is
-                            not used. (Default: None)
+                            The experimental score threshold for the pairs extracted from STRING. Ranges from 0 to 1000. Default is None, which means that the experimental score is not used. (Default: None)
+
--- a/senseppi/__init__.py
+++ b/senseppi/__init__.py
-__version__ = "0.6.1"
+__version__ = "0.6.2"
 __author__ = "Konstantin Volzhenin"

 from . import model, commands, esm2_model, dataset, utils, network_utils

--- a/senseppi/commands/predict.py
+++ b/senseppi/commands/predict.py
@@ -71,7 +71,7 @@ def add_args(parser):
                                          )
    predict_args.add_argument("--model_path", type=str, default=None,
                              help="A path to .ckpt file that contains weights to a pretrained model. If "
-                                   "None, the preinstalled senseppi.ckpt trained version is used. "
+                                   "None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. "
                                   "(Trained on human PPIs)")
    predict_args.add_argument("--pairs_file", type=str, default=None,
                              help="A path to a .tsv file with pairs of proteins to test (Optional). If not provided, "

--- a/senseppi/commands/predict_string.py
+++ b/senseppi/commands/predict_string.py
@@ -173,7 +173,7 @@ def add_args(parser):
                                               "typed (separated by whitespaces).")
    string_pred_args.add_argument("--model_path", type=str, default=None,
                                  help="A path to .ckpt file that contains weights to a pretrained model. If "
-                                       "None, the preinstalled senseppi.ckpt trained version is used. "
+                                       "None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. "
                                       "(Trained on human PPIs)")
    string_pred_args.add_argument("-s", "--species", type=int, default=9606,
                                  help="Species from STRING database. Default: H. Sapiens")

--- a/senseppi/commands/test.py
+++ b/senseppi/commands/test.py
@@ -47,7 +47,7 @@ def add_args(parser):
                                          )
    test_args.add_argument("--model_path", type=str, default=None,
                           help="A path to .ckpt file that contains weights to a pretrained model. If "
-                                "None, the preinstalled senseppi.ckpt trained version is used. "
+                                "None, the preinstalled fly_worm_human_chicken.ckpt trained version is used. "
                                "(Trained on human PPIs)")
    test_args.add_argument("-o", "--output", type=str, default="test_metrics",
                           help="A path to a file where the test metrics will be saved. "