Update README.md

41b9da71 · Riccardo Vicedomini · 0d2fcdea · 41b9da71
Commit 41b9da71 authored Feb 16, 2020 by Riccardo Vicedomini
Hide whitespace changes
Inline Side-by-side

Showing with 37 additions and 69 deletions

README.md README.md +37 -69

No files found.
--- a/README.md
+++ b/README.md
@@ -17,27 +17,32 @@ System requirements
 Software requirements
 ---------------------
-+ PSI-BLAST
+ HMMer-3
-+ HMMer-3.0
+ DAMA
+ GNU parallel (optional but recommended for running jobs on multiple threads)
-CLADE's model library
+Installation
---------------------
+------------
-In order to run MetaCLADE, CLADE's library must be downloaded from [here](http://134.157.11.245/CLADE/deploy/models/).
+Latest development version of MetaCLADE2 can be obtained running the following command:
-Let `[MetaCLADE_DIR]` be the directory of MetaCLADE. The library should be extracted in the following two directories:
+```
+git clone http://gitlab.lcqb.upmc.fr/vicedomini/metaclade2.git
 ```
-[MetaCLADE_DIR]/data/models/pssms/
+Then, it is advised to include MetaCLADE2 directory in your PATH environment variable by adding the following line to your `~/.bashrc` file:
-[MetaCLADE_DIR]/data/models/hmms/
+```
+export PATH=[MetaCLADE_DIR]:${PATH}"
 ```
+where `[MetaCLADE_DIR]` is MetaCLADE's installation directory.
 MetaCLADE usage
 ---------------
 ```
  USAGE: metaclade2 -i <input_fasta> -N <name> [options]
  MANDATORY OPTIONS:    
    -i, --input <path>  Input file of AA sequences in FASTA format
+                        (protein sequences or predicted CDS)
    -N, --name <str>    Dataset/job name
  MetaCLADE OPTIONS:              
@@ -67,41 +72,6 @@ MetaCLADE usage
                             (e.g., use --time-limit 2:30:00 for setting a limit of 2h and 30m)
 ```
-### 1. MetaCLADE configuration
-First of all it is advised to include (if it is not) MetaCLADE main directory to your PATH environment variable by adding the following line to your `~/.bashrc`
-```
-export PATH=[MetaCLADE_DIR]:${PATH}"
-```
-where `[MetaCLADE_DIR]` is MetaCLADE's installation directory.
-Then, in order to create MetaCLADE jobs you must first create a *Run configuration file* (see below) and run the following command:
-```
-metaclade --run-cfg [Run configuration file]
-```
-#### Input file preprocessing
-Before running MetaCLADE on the input FASTA file you should build a BLAST database. 
-You can either set the CREATE_BLASTDB parameter to True in the Run configuration file (see below) or you can manually run the following command:
-```
-makeblastdb -dbtype prot -in /path/to/sequence/database/CDS.faa
-```
-#### Run configuration file example (mandatory)
-Lines starting with a semicolon are considered as comments and are not taken into account. Also, you should provide absolute paths.
-```
-[Parameters]
-DATASET_NAME = CDS 
-FASTA_FILE = /path/to/sequence/database/CDS.faa 
-NUMBER_OF_JOBS = 32 
-;CREATE_BLASTDB = True
-;WORKING_DIR = /path/to/a/custom/working/directory 
-;TMP_DIR = /path/to/a/custom/temporary/directory 
-;DOMAINS_LIST = /path/to/a/custom/model.list
-```
-A custom working directory (where jobs and results are saved) could be set with the `WORKING_DIR` parameter (the default value is the directory from which the metaclade command has been called). 
-A custom temporary directory could be set using the `TMP_DIR` parameter (the default is a temp subdirectory in the working directory). 
-If you want to restrict MetaCLADE's annotation to a subset of domains, you could provide a file containing one domain identifier per line to the `DOMAINS_DIR` parameter.
 #### MetaCLADE configuration file example (optional)
 Optionally, a MetaCLADE configuration file could be provided to metaclade with the parameter `--metaclade-cfg`. 
 This file could be used to set custom paths to PSI-BLAST/HMMER/Python executables or to the MetaCLADE model library.
@@ -118,52 +88,50 @@ Lines starting with a semicolon are not taken into account. Also, you should pro
 ```
-### 2. MetaCLADE jobs
+### MetaCLADE jobs
 By default jobs are created in `[WORKING_DIR]/[DATASET_NAME]/jobs/`.
-Each (numbered) folder in this directory represents a step of the pipeline and contains several `*.sh` files (depending on the value assigned to the `NUMBER_OF_JOBS` parameter):
+By default `[WORKING_DIR]` is the current directory where the `metaclade2` command has been run.
+Each (numbered) folder in this directory represents a step of the pipeline and contains several `*.sh` files (depending on the value provided with the `-j [NUMBER_OF_JOBS]` parameter):
 ```
-[DATASET_NAME]_0.sh
 [DATASET_NAME]_1.sh
 [DATASET_NAME]_2.sh
 ...
+[DATASET_NAME]_[NUMBER_OF_JOBS].sh
 ```
-Jobs must be run in the following order:
+Jobs **must** be run in the following order:
 ```
-[WORKING_DIR]/[DATASET_NAME]/jobs/1_model_search/
+[WORKING_DIR]/[DATASET_NAME]/jobs/1_search/
-[WORKING_DIR]/[DATASET_NAME]/jobs/2_arff_files/
+[WORKING_DIR]/[DATASET_NAME]/jobs/2_filter/
-[WORKING_DIR]/[DATASET_NAME]/jobs/3_mclade_eval/
+[WORKING_DIR]/[DATASET_NAME]/jobs/3_arch/
-[WORKING_DIR]/[DATASET_NAME]/jobs/4_best_domains/
-[WORKING_DIR]/[DATASET_NAME]/jobs/5_final_prediction/
 ```
-In the first three directories you can find a `submit.sh` file that contains the `qsub` command to submit each job to the queue system of a SGE environment. 
+Each file in a given directory can be submitted independently to the HPC environment.
-This file can be used (or adapted for other HPC environments) in order to submit all jobs at each step.
-### 3. MetaCLADE results
+### MetaCLADE2 results
 By default results are stored in the `[WORKING_DIR]/[DATASET_NAME]/results/` directory.
 Each (numbered) folder in this directory contains the results after each step of the pipeline. 
 After running each step, the final annotation is saved in the file
 ```
-[WORKING_DIR]/[DATASET_NAME]/results/5_final_prediction/final_prediction.mclade
+[WORKING_DIR]/[DATASET_NAME]/results/3_arch/
 ```
 It is a tab-separated values (TSV) file whose lines represent annotations.
-Each annotation has the following 10 fields:
+Each annotation has the following fields:
-* E-value
+* Sequence identifier
-* Score 
+* Sequence start
+* Sequence end
+* Sequence length
+* Domain identifier (_i.e._, Pfam accession number)
 * Model identifier 
 * Model start
 * Model end
-* Domain identifier (i.e., Pfam accession number)
+* Model size
-* Sequence identifier 
+* E-value of the prediction
-* Sequence start
+* Bitscore of the prediction
-* Sequence end 
+* Accuracy value in the interval [0,1]
-* Prediction probability
+Example
+-------
-Example command:
 metaclade2 -i ./test/test.fa -N pippo -d PF00875,PF03441,PF03167,PF12546 -W ./test/ --arch --sge --pe smp -j 2 -t 2