#### A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling
Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism, one species. This vision is being dramatically changed by the advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial populations in fundamental biochemical functions. The new landscape we face requires the reconstruction of biochemical and regulatory pathways at the community level in a given environment. In order to understand how environmental factors affect the genetic material and the dynamics of the expression from one environment to another, one wishes to quantitatively relate genetic information with these factors. For this, we need to be as precise as possible in evaluating the quantity of gene protein sequences or transcripts associated to a given pathway. We wish to estimate the precise abundance of protein domains, but also recognise their weak presence or absence.
We introduce MetaCLADE2, and improved profile-based domain annotation pipeline based on the multi-source domain annotation strategy. It provides a domain annotation realised directly from reads, and reaches an improved identification of the catalog of functions in a microbiome. MetaCLADE2 can be applied to either metagenomic or metatranscriptomic datasets as well as proteomes.
System requirements
-------------------
+ MetaCLADE has been developed under a Unix environment.
+ The bash environment should be installed.
+ Python 3 is required for this package.
Software requirements
---------------------
+ PSI-BLAST
+ HMMer-3.0
CLADE's model library
---------------------
In order to run MetaCLADE, CLADE's library must be downloaded from [here](http://134.157.11.245/CLADE/deploy/models/).
Let `[MetaCLADE_DIR]` be the directory of MetaCLADE. The library should be extracted in the following two directories:
-i, --input <path> Input file of AA sequences in FASTA format
-N, --name <str> Dataset/job name
MetaCLADE OPTIONS:
-a, --arch Use DAMA to properly compute domain architectures
(useful only for long protein sequences)
-d, --domain-list <str> Comma-spearated list of Pfam accession numbers of
the domains to be considered in the analysis
(e.g., "PF00875,PF03441")
-D, --domain-file <path> File that contains the Pfam accession numbers
of the domains to be considered (one per line)
-e, --evalue-cutoff <float> E-value cutoff
-E, --evalue-cutconf <float> Confidence threshold used by DAMA to add new domains into the architecture.
-W, --work-dir <path> Working directory, where jobs and results are saved
OTHER OPTIONS:
-j, --jobs <num> Number of jobs to be created (default:16)
-t, --threads <num> Number of threads for each job (default:1)
-h, --help Print this help message
-V, --version Print version
SGE OPTIONS:
--sge Run MetaCLADE jobs on a SGE HPC environment
--pe <name> Parallel environment to use (mandatory)
--queue <name> Name of a specific queue where jobs are submitted
--time-limit <hh:mm:ss> Time limit for submitted jobs formatted as hh:mm:ss
where hh, mm, ss represent hours, minutes, and seconds respectively
(e.g., use --time-limit 2:30:00 for setting a limit of 2h and 30m)
```
### 1. MetaCLADE configuration
First of all it is advised to include (if it is not) MetaCLADE main directory to your PATH environment variable by adding the following line to your `~/.bashrc`
```
export PATH=[MetaCLADE_DIR]:${PATH}"
```
where `[MetaCLADE_DIR]` is MetaCLADE's installation directory.
Then, in order to create MetaCLADE jobs you must first create a *Run configuration file* (see below) and run the following command:
```
metaclade --run-cfg [Run configuration file]
```
#### Input file preprocessing
Before running MetaCLADE on the input FASTA file you should build a BLAST database.
You can either set the CREATE_BLASTDB parameter to True in the Run configuration file (see below) or you can manually run the following command:
A custom working directory (where jobs and results are saved) could be set with the `WORKING_DIR` parameter (the default value is the directory from which the metaclade command has been called).
A custom temporary directory could be set using the `TMP_DIR` parameter (the default is a temp subdirectory in the working directory).
If you want to restrict MetaCLADE's annotation to a subset of domains, you could provide a file containing one domain identifier per line to the `DOMAINS_DIR` parameter.
#### MetaCLADE configuration file example (optional)
Optionally, a MetaCLADE configuration file could be provided to metaclade with the parameter `--metaclade-cfg`.
This file could be used to set custom paths to PSI-BLAST/HMMER/Python executables or to the MetaCLADE model library.
Lines starting with a semicolon are not taken into account. Also, you should provide absolute paths.
```
[Programs]
;PSIBLAST_DIR = /home/ncbi-blast-2.7.1+/bin/
;HMMER_DIR = /home/hmmer-3.2.1/bin/
;PYTHON_DIR = /home/python-2.7.15/bin
[Models]
;PSSMS_DIR = /home/MetaCLADE/data/models/pssms
;HMMS_DIR = /home/MetaCLADE/data/models/hmms
```
### 2. MetaCLADE jobs
By default jobs are created in `[WORKING_DIR]/[DATASET_NAME]/jobs/`.
Each (numbered) folder in this directory represents a step of the pipeline and contains several `*.sh` files (depending on the value assigned to the `NUMBER_OF_JOBS` parameter):
In the first three directories you can find a `submit.sh` file that contains the `qsub` command to submit each job to the queue system of a SGE environment.
This file can be used (or adapted for other HPC environments) in order to submit all jobs at each step.
### 3. MetaCLADE results
By default results are stored in the `[WORKING_DIR]/[DATASET_NAME]/results/` directory.
Each (numbered) folder in this directory contains the results after each step of the pipeline.
After running each step, the final annotation is saved in the file