Commit 2912e611 by Riccardo Vicedomini

updated readme

parent 12b34f10
...@@ -10,16 +10,16 @@ We introduce MetaCLADE2, and improved profile-based domain annotation pipeline b ...@@ -10,16 +10,16 @@ We introduce MetaCLADE2, and improved profile-based domain annotation pipeline b
# System requirements # System requirements
+ MetaCLADE2 has been developed under a Linux environment. + MetaCLADE2 has been developed and tested under a Linux environment.
+ The bash environment should be installed. + The bash environment should be installed.
+ Python 3 is required for this package. + Python 3 is required for this package.
# Software requirements # Software requirements
+ HMMer-3 + [HMMer-3](http://hmmer.org)
+ DAMA + [DAMA](http://www.lcqb.upmc.fr/DAMA)
+ GNU parallel (optional but recommended for running jobs on multiple threads) + [GNU parallel](https://www.gnu.org/software/parallel) (optional but strongly recommended)
# Installation # Installation
...@@ -41,10 +41,12 @@ where `[MetaCLADE2_DIR]` is MetaCLADE2 installation directory. ...@@ -41,10 +41,12 @@ where `[MetaCLADE2_DIR]` is MetaCLADE2 installation directory.
USAGE: metaclade2 -i <input_fasta> -N <name> [options] USAGE: metaclade2 -i <input_fasta> -N <name> [options]
MANDATORY OPTIONS: MANDATORY OPTIONS:
-N, --name <str> Dataset/job name and name of the directory used to
store intermediate results
-i, --input <path> Input file of AA sequences in FASTA format -i, --input <path> Input file of AA sequences in FASTA format
-N, --name <str> Dataset/job name
MetaCLADE OPTIONS: MetaCLADE2 OPTIONS:
-o, --output <path> Output file of domain architecture
-a, --arch Use DAMA to properly compute domain architectures -a, --arch Use DAMA to properly compute domain architectures
(useful only for long protein sequences) (useful only for long protein sequences)
-d, --domain-list <str> Comma-spearated list of Pfam accession numbers of -d, --domain-list <str> Comma-spearated list of Pfam accession numbers of
...@@ -52,7 +54,8 @@ where `[MetaCLADE2_DIR]` is MetaCLADE2 installation directory. ...@@ -52,7 +54,8 @@ where `[MetaCLADE2_DIR]` is MetaCLADE2 installation directory.
(e.g., "PF00875,PF03441") (e.g., "PF00875,PF03441")
-D, --domain-file <path> File that contains the Pfam accession numbers -D, --domain-file <path> File that contains the Pfam accession numbers
of the domains to be considered (one per line) of the domains to be considered (one per line)
-W, --work-dir <path> Working directory, where jobs and results are saved -W, --work-dir <path> Working directory (default:current directory)
--remove-temp Remove temporary intermediate files, keeping only results and logs
DAMA OPTIONS: DAMA OPTIONS:
-e, --evalue-cutoff <float> E-value cutoff (default:1e-3) -e, --evalue-cutoff <float> E-value cutoff (default:1e-3)
...@@ -70,7 +73,7 @@ where `[MetaCLADE2_DIR]` is MetaCLADE2 installation directory. ...@@ -70,7 +73,7 @@ where `[MetaCLADE2_DIR]` is MetaCLADE2 installation directory.
-V, --version Print version -V, --version Print version
SGE OPTIONS: SGE OPTIONS:
--sge Run MetaCLADE jobs on a SGE HPC environment --sge Run MetaCLADE2 jobs on a SGE HPC environment
--pe <name> Parallel environment to use (mandatory) --pe <name> Parallel environment to use (mandatory)
--queue <name> Name of a specific queue where jobs are submitted --queue <name> Name of a specific queue where jobs are submitted
--time-limit <hh:mm:ss> Time limit for submitted jobs formatted as hh:mm:ss --time-limit <hh:mm:ss> Time limit for submitted jobs formatted as hh:mm:ss
...@@ -78,8 +81,13 @@ where `[MetaCLADE2_DIR]` is MetaCLADE2 installation directory. ...@@ -78,8 +81,13 @@ where `[MetaCLADE2_DIR]` is MetaCLADE2 installation directory.
(e.g., use --time-limit 2:30:00 for setting a limit of 2h and 30m) (e.g., use --time-limit 2:30:00 for setting a limit of 2h and 30m)
``` ```
Scripts and computation results are stored in `[WORKING_DIR]/[DATASET_NAME]`. By default `[WORKING_DIR]` is the current directory (the one from which `metaclade2` command is run).
It is possible to change this path with the `-W|--work-dir` argument.
It is finally possible to delete intermediate files (after a successful execution) with
#### Optional MetaCLADE2 configuration file (available soon) #### Optional MetaCLADE2 configuration file (available soon)
MetaCLADE2 optionnally accepts a configuration file that allows the user to set custom paths to the MetaCLADE model library. MetaCLADE2 optionnally accepts a configuration file that allows the user to set custom paths to the MetaCLADE2 model library.
Lines starting with a semicolon are not taken into account and are considered as comments. Lines starting with a semicolon are not taken into account and are considered as comments.
You **must** also provide absolute paths. You **must** also provide absolute paths.
``` ```
...@@ -88,37 +96,9 @@ You **must** also provide absolute paths. ...@@ -88,37 +96,9 @@ You **must** also provide absolute paths.
;hmms_path = /absolute/path/to/data/models/HMMs ;hmms_path = /absolute/path/to/data/models/HMMs
``` ```
# MetaCLADE2 output architecture
# MetaCLADE jobs The domain architecture for the sequences provided in input is saved as a TSV file to `[WORKING_DIR]/[DATASET_NAME]/[DATASET_NAME].arch.tsv` (or to the path specified with the `-o|--output` argument.
By default jobs are created in `[WORKING_DIR]/[DATASET_NAME]/jobs/`. By default `[WORKING_DIR]` is the current directory where the `metaclade2` command has been run. Each line represents a domain annotation and has the following fields/columns:
Using the `--sge` parameter it is possible to automatically handle MetaCLADE2 pipeline in a SGE-based cluster (see [MetaCLADE2 usage](#metaclade2-usage) section).
Each (numbered) folder in this directory represents a step of the pipeline and contains several `*.sh` files (depending on the value provided with the `-j [NUMBER_OF_JOBS]` parameter):
```
[DATASET_NAME]_1.sh
[DATASET_NAME]_2.sh
...
[DATASET_NAME]_[NUMBER_OF_JOBS].sh
```
Jobs **must** be run in the following order:
```
[WORKING_DIR]/[DATASET_NAME]/jobs/1_search/
[WORKING_DIR]/[DATASET_NAME]/jobs/2_filter/
[WORKING_DIR]/[DATASET_NAME]/jobs/3_arch/
```
Each file in a given directory can be submitted independently to the HPC environment.
# MetaCLADE2 results
By default results are stored in the `[WORKING_DIR]/[DATASET_NAME]/results/` directory.
Each (numbered) folder in this directory contains the results after each step of the pipeline.
After running each step, the final annotation is saved in the file named
```
[WORKING_DIR]/[DATASET_NAME]/results/3_arch/[DATASET_NAME].arch.txt
```
It is a tab-separated values (TSV) file whose lines represent annotations.
Each annotation has the following fields:
* Sequence identifier * Sequence identifier
* Sequence start * Sequence start
* Sequence end * Sequence end
...@@ -131,32 +111,30 @@ Each annotation has the following fields: ...@@ -131,32 +111,30 @@ Each annotation has the following fields:
* E-value of the prediction * E-value of the prediction
* Bitscore of the prediction * Bitscore of the prediction
* Accuracy value in the interval [0,1] * Accuracy value in the interval [0,1]
* Species of the template used to build the model
# Example # Example
A test dataset is available in the `test` directory and can be run with the following command: A test dataset is available in the `test` directory and can be run, using 4 threads, with the following command:
``` ```
cd [METACLADE2_DIR] cd [METACLADE2_DIR]/test
metaclade2 -i ./test/test.fa -N testDataSet -d PF00875,PF03441,PF03167,PF12546 -W ./test -j 2 metaclade2 -i test.fa -N output -d PF00875,PF03441,PF03167,PF12546 -t 4
``` ```
This will create at most two scrips (jobs) in each directory of the pipeline. Alternatively, in a SGE-based cluster, the following command will run MetaCLADE2 submitting at most 2 jobs, each one using 4 CPUs, for each step of the pipeline:
Alternatively, if you are running MetaCLADE2 in a SGE cluster, the following script will run at most 2 jobs, each one using 2 CPUs, for each step of the pipeline:
``` ```
cd [METACLADE2_DIR] cd [METACLADE2_DIR]/test
metaclade2 -i ./test/test.fa -N testDataSet -d PF00875,PF03441,PF03167,PF12546 -W ./test --sge --pe smp -j 2 -t 2 metaclade2 -i test.fa -N output -d PF00875,PF03441,PF03167,PF12546 --sge --pe smp -j 2 -t 4
``` ```
Results will be stored in the `[METACLADE2_DIR]/test/testDataSet/results` directory. Resulting annotation will be saved in the `[METACLADE2_DIR]/test/output/output.arch.tsv` file and it should look as follows:
The final annotation file should look as follows:
``` ```
tr|V7B5W0|V7B5W0_PHAVU 285 476 682 PF03441 E4X2Z2_OIKDI_300-507 4 193 196 3.3e-71 226.4 0.97 Oikopleura dioica tr|A0A072NB93|A0A072NB93_9DEIO 12 141 766 PF00875 A6FVP5_9RHOB_1-137 1 129 130 8.2e-36 111.3 0.97 Roseobacter sp. AzwK-3b
tr|A0A072NB93|A0A072NB93_9DEIO 274 469 766 PF03441 HMMer-3 1 199 202 6.9e-56 176.6 0.91 unavailable
tr|A0A072NB93|A0A072NB93_9DEIO 591 753 766 PF03167 A0A0G0AUB5_9BACT_52-217 1 162 163 5.6e-75 238.8 0.99 Candidatus Roizmanbacteria bacterium GW2011_GWC2_34_23
tr|F0RPZ8|F0RPZ8_DEIPM 5 127 757 PF00875 K8GNY4_9CYAN_4-128 1 119 122 1.7e-35 110.2 0.97 Oscillatoriales cyanobacterium JSC-12
tr|F0RPZ8|F0RPZ8_DEIPM 267 461 757 PF03441 A0A0F3K8Y1_9NEIS_266-465 1 188 193 4e-64 203.4 0.97 Aquitalea magnusonii
tr|F0RPZ8|F0RPZ8_DEIPM 586 748 757 PF03167 A0A0G0AUB5_9BACT_52-217 2 162 163 7.8e-72 228.5 0.98 Candidatus Roizmanbacteria bacterium GW2011_GWC2_34_23
tr|V7B5W0|V7B5W0_PHAVU 7 173 682 PF00875 HMMer-3 1 157 164 2.5e-44 138.9 0.93 unavailable tr|V7B5W0|V7B5W0_PHAVU 7 173 682 PF00875 HMMer-3 1 157 164 2.5e-44 138.9 0.93 unavailable
tr|V7B5W0|V7B5W0_PHAVU 285 476 682 PF03441 E4X2Z2_OIKDI_300-507 4 193 196 3.3e-71 226.4 0.97 Oikopleura dioica
tr|V7B5W0|V7B5W0_PHAVU 505 645 682 PF12546 S8D414_9LAMI_84-208 1 138 139 2.9e-45 142.1 0.82 Genlisea aurea tr|V7B5W0|V7B5W0_PHAVU 505 645 682 PF12546 S8D414_9LAMI_84-208 1 138 139 2.9e-45 142.1 0.82 Genlisea aurea
tr|F0RPZ8|F0RPZ8_DEIPM 586 748 757 PF03167 A0A0G0AUB5_9BACT_52-217 2 162 163 7.8e-72 228.5 0.98 Candidatus Roizmanbacteria bacterium GW2011_GWC2_34_23
tr|F0RPZ8|F0RPZ8_DEIPM 267 461 757 PF03441 A0A0F3K8Y1_9NEIS_266-465 1 188 193 4e-64 203.4 0.97 Aquitalea magnusonii
tr|F0RPZ8|F0RPZ8_DEIPM 5 127 757 PF00875 K8GNY4_9CYAN_4-128 1 119 122 1.7e-35 110.2 0.97 Oscillatoriales cyanobacterium JSC-12
tr|A0A072NB93|A0A072NB93_9DEIO 591 753 766 PF03167 A0A0G0AUB5_9BACT_52-217 1 162 163 5.6e-75 238.8 0.99 Candidatus Roizmanbacteria bacterium GW2011_GWC2_34_23
tr|A0A072NB93|A0A072NB93_9DEIO 274 469 766 PF03441 HMMer-3 1 199 202 6.9e-56 176.6 0.91 unavailable
tr|A0A072NB93|A0A072NB93_9DEIO 12 141 766 PF00875 A6FVP5_9RHOB_1-137 1 129 130 8.2e-36 111.3 0.97 Roseobacter sp. AzwK-3b
``` ```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment