Commit 41b9da71 by Riccardo Vicedomini

Update README.md

parent 0d2fcdea
...@@ -17,27 +17,32 @@ System requirements ...@@ -17,27 +17,32 @@ System requirements
Software requirements Software requirements
--------------------- ---------------------
+ PSI-BLAST + HMMer-3
+ HMMer-3.0 + DAMA
+ GNU parallel (optional but recommended for running jobs on multiple threads)
CLADE's model library Installation
--------------------- ------------
In order to run MetaCLADE, CLADE's library must be downloaded from [here](http://134.157.11.245/CLADE/deploy/models/). Latest development version of MetaCLADE2 can be obtained running the following command:
Let `[MetaCLADE_DIR]` be the directory of MetaCLADE. The library should be extracted in the following two directories: ```
git clone http://gitlab.lcqb.upmc.fr/vicedomini/metaclade2.git
``` ```
[MetaCLADE_DIR]/data/models/pssms/ Then, it is advised to include MetaCLADE2 directory in your PATH environment variable by adding the following line to your `~/.bashrc` file:
[MetaCLADE_DIR]/data/models/hmms/ ```
export PATH=[MetaCLADE_DIR]:${PATH}"
``` ```
where `[MetaCLADE_DIR]` is MetaCLADE's installation directory.
MetaCLADE usage MetaCLADE usage
--------------- ---------------
``` ```
USAGE: metaclade2 -i <input_fasta> -N <name> [options] USAGE: metaclade2 -i <input_fasta> -N <name> [options]
MANDATORY OPTIONS: MANDATORY OPTIONS:
-i, --input <path> Input file of AA sequences in FASTA format -i, --input <path> Input file of AA sequences in FASTA format
(protein sequences or predicted CDS)
-N, --name <str> Dataset/job name -N, --name <str> Dataset/job name
MetaCLADE OPTIONS: MetaCLADE OPTIONS:
...@@ -67,41 +72,6 @@ MetaCLADE usage ...@@ -67,41 +72,6 @@ MetaCLADE usage
(e.g., use --time-limit 2:30:00 for setting a limit of 2h and 30m) (e.g., use --time-limit 2:30:00 for setting a limit of 2h and 30m)
``` ```
### 1. MetaCLADE configuration
First of all it is advised to include (if it is not) MetaCLADE main directory to your PATH environment variable by adding the following line to your `~/.bashrc`
```
export PATH=[MetaCLADE_DIR]:${PATH}"
```
where `[MetaCLADE_DIR]` is MetaCLADE's installation directory.
Then, in order to create MetaCLADE jobs you must first create a *Run configuration file* (see below) and run the following command:
```
metaclade --run-cfg [Run configuration file]
```
#### Input file preprocessing
Before running MetaCLADE on the input FASTA file you should build a BLAST database.
You can either set the CREATE_BLASTDB parameter to True in the Run configuration file (see below) or you can manually run the following command:
```
makeblastdb -dbtype prot -in /path/to/sequence/database/CDS.faa
```
#### Run configuration file example (mandatory)
Lines starting with a semicolon are considered as comments and are not taken into account. Also, you should provide absolute paths.
```
[Parameters]
DATASET_NAME = CDS
FASTA_FILE = /path/to/sequence/database/CDS.faa
NUMBER_OF_JOBS = 32
;CREATE_BLASTDB = True
;WORKING_DIR = /path/to/a/custom/working/directory
;TMP_DIR = /path/to/a/custom/temporary/directory
;DOMAINS_LIST = /path/to/a/custom/model.list
```
A custom working directory (where jobs and results are saved) could be set with the `WORKING_DIR` parameter (the default value is the directory from which the metaclade command has been called).
A custom temporary directory could be set using the `TMP_DIR` parameter (the default is a temp subdirectory in the working directory).
If you want to restrict MetaCLADE's annotation to a subset of domains, you could provide a file containing one domain identifier per line to the `DOMAINS_DIR` parameter.
#### MetaCLADE configuration file example (optional) #### MetaCLADE configuration file example (optional)
Optionally, a MetaCLADE configuration file could be provided to metaclade with the parameter `--metaclade-cfg`. Optionally, a MetaCLADE configuration file could be provided to metaclade with the parameter `--metaclade-cfg`.
This file could be used to set custom paths to PSI-BLAST/HMMER/Python executables or to the MetaCLADE model library. This file could be used to set custom paths to PSI-BLAST/HMMER/Python executables or to the MetaCLADE model library.
...@@ -118,52 +88,50 @@ Lines starting with a semicolon are not taken into account. Also, you should pro ...@@ -118,52 +88,50 @@ Lines starting with a semicolon are not taken into account. Also, you should pro
``` ```
### 2. MetaCLADE jobs ### MetaCLADE jobs
By default jobs are created in `[WORKING_DIR]/[DATASET_NAME]/jobs/`. By default jobs are created in `[WORKING_DIR]/[DATASET_NAME]/jobs/`.
Each (numbered) folder in this directory represents a step of the pipeline and contains several `*.sh` files (depending on the value assigned to the `NUMBER_OF_JOBS` parameter): By default `[WORKING_DIR]` is the current directory where the `metaclade2` command has been run.
Each (numbered) folder in this directory represents a step of the pipeline and contains several `*.sh` files (depending on the value provided with the `-j [NUMBER_OF_JOBS]` parameter):
``` ```
[DATASET_NAME]_0.sh
[DATASET_NAME]_1.sh [DATASET_NAME]_1.sh
[DATASET_NAME]_2.sh [DATASET_NAME]_2.sh
... ...
[DATASET_NAME]_[NUMBER_OF_JOBS].sh
``` ```
Jobs must be run in the following order: Jobs **must** be run in the following order:
``` ```
[WORKING_DIR]/[DATASET_NAME]/jobs/1_model_search/ [WORKING_DIR]/[DATASET_NAME]/jobs/1_search/
[WORKING_DIR]/[DATASET_NAME]/jobs/2_arff_files/ [WORKING_DIR]/[DATASET_NAME]/jobs/2_filter/
[WORKING_DIR]/[DATASET_NAME]/jobs/3_mclade_eval/ [WORKING_DIR]/[DATASET_NAME]/jobs/3_arch/
[WORKING_DIR]/[DATASET_NAME]/jobs/4_best_domains/
[WORKING_DIR]/[DATASET_NAME]/jobs/5_final_prediction/
``` ```
In the first three directories you can find a `submit.sh` file that contains the `qsub` command to submit each job to the queue system of a SGE environment. Each file in a given directory can be submitted independently to the HPC environment.
This file can be used (or adapted for other HPC environments) in order to submit all jobs at each step.
### 3. MetaCLADE results ### MetaCLADE2 results
By default results are stored in the `[WORKING_DIR]/[DATASET_NAME]/results/` directory. By default results are stored in the `[WORKING_DIR]/[DATASET_NAME]/results/` directory.
Each (numbered) folder in this directory contains the results after each step of the pipeline. Each (numbered) folder in this directory contains the results after each step of the pipeline.
After running each step, the final annotation is saved in the file After running each step, the final annotation is saved in the file
``` ```
[WORKING_DIR]/[DATASET_NAME]/results/5_final_prediction/final_prediction.mclade [WORKING_DIR]/[DATASET_NAME]/results/3_arch/
``` ```
It is a tab-separated values (TSV) file whose lines represent annotations. It is a tab-separated values (TSV) file whose lines represent annotations.
Each annotation has the following 10 fields: Each annotation has the following fields:
* E-value * Sequence identifier
* Score * Sequence start
* Sequence end
* Sequence length
* Domain identifier (_i.e._, Pfam accession number)
* Model identifier * Model identifier
* Model start * Model start
* Model end * Model end
* Domain identifier (i.e., Pfam accession number) * Model size
* Sequence identifier * E-value of the prediction
* Sequence start * Bitscore of the prediction
* Sequence end * Accuracy value in the interval [0,1]
* Prediction probability
Example
-------
Example command:
metaclade2 -i ./test/test.fa -N pippo -d PF00875,PF03441,PF03167,PF12546 -W ./test/ --arch --sge --pe smp -j 2 -t 2 metaclade2 -i ./test/test.fa -N pippo -d PF00875,PF03441,PF03167,PF12546 -W ./test/ --arch --sge --pe smp -j 2 -t 2
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment