Commit ed79a631 by vice1987

First commit

parents
Riccardo Vicedomini
Jean-Pierre Bouly
Soizic Cheminant-Navarro
Elodie Laine
Angela Falciatore
Alessandra Carbone
CeCILL FREE SOFTWARE LICENSE AGREEMENT
Version 2.1 dated 2013-06-21
Notice
This Agreement is a Free Software license agreement that is the result
of discussions between its authors in order to ensure compliance with
the two main principles guiding its drafting:
* firstly, compliance with the principles governing the distribution
of Free Software: access to source code, broad rights granted to users,
* secondly, the election of a governing law, French law, with which it
is conformant, both as regards the law of torts and intellectual
property law, and the protection that it offers to both authors and
holders of the economic rights over software.
The authors of the CeCILL (for Ce[a] C[nrs] I[nria] L[ogiciel] L[ibre])
license are:
Commissariat à l'énergie atomique et aux énergies alternatives - CEA, a
public scientific, technical and industrial research establishment,
having its principal place of business at 25 rue Leblanc, immeuble Le
Ponant D, 75015 Paris, France.
Centre National de la Recherche Scientifique - CNRS, a public scientific
and technological establishment, having its principal place of business
at 3 rue Michel-Ange, 75794 Paris cedex 16, France.
Institut National de Recherche en Informatique et en Automatique -
Inria, a public scientific and technological establishment, having its
principal place of business at Domaine de Voluceau, Rocquencourt, BP
105, 78153 Le Chesnay cedex, France.
Preamble
The purpose of this Free Software license agreement is to grant users
the right to modify and redistribute the software governed by this
license within the framework of an open source distribution model.
The exercising of this right is conditional upon certain obligations for
users so as to preserve this status for all subsequent redistributions.
In consideration of access to the source code and the rights to copy,
modify and redistribute granted by the license, users are provided only
with a limited warranty and the software's author, the holder of the
economic rights, and the successive licensors only have limited liability.
In this respect, the risks associated with loading, using, modifying
and/or developing or reproducing the software by the user are brought to
the user's attention, given its Free Software status, which may make it
complicated to use, with the result that its use is reserved for
developers and experienced professionals having in-depth computer
knowledge. Users are therefore encouraged to load and test the
suitability of the software as regards their requirements in conditions
enabling the security of their systems and/or data to be ensured and,
more generally, to use and operate it in the same conditions of
security. This Agreement may be freely reproduced and published,
provided it is not altered, and that no provisions are either added or
removed herefrom.
This Agreement may apply to any or all software for which the holder of
the economic rights decides to submit the use thereof to its provisions.
Frequently asked questions can be found on the official website of the
CeCILL licenses family (http://www.cecill.info/index.en.html) for any
necessary clarification.
Article 1 - DEFINITIONS
For the purpose of this Agreement, when the following expressions
commence with a capital letter, they shall have the following meaning:
Agreement: means this license agreement, and its possible subsequent
versions and annexes.
Software: means the software in its Object Code and/or Source Code form
and, where applicable, its documentation, "as is" when the Licensee
accepts the Agreement.
Initial Software: means the Software in its Source Code and possibly its
Object Code form and, where applicable, its documentation, "as is" when
it is first distributed under the terms and conditions of the Agreement.
Modified Software: means the Software modified by at least one
Contribution.
Source Code: means all the Software's instructions and program lines to
which access is required so as to modify the Software.
Object Code: means the binary files originating from the compilation of
the Source Code.
Holder: means the holder(s) of the economic rights over the Initial
Software.
Licensee: means the Software user(s) having accepted the Agreement.
Contributor: means a Licensee having made at least one Contribution.
Licensor: means the Holder, or any other individual or legal entity, who
distributes the Software under the Agreement.
Contribution: means any or all modifications, corrections, translations,
adaptations and/or new functions integrated into the Software by any or
all Contributors, as well as any or all Internal Modules.
Module: means a set of sources files including their documentation that
enables supplementary functions or services in addition to those offered
by the Software.
External Module: means any or all Modules, not derived from the
Software, so that this Module and the Software run in separate address
spaces, with one calling the other when they are run.
Internal Module: means any or all Module, connected to the Software so
that they both execute in the same address space.
GNU GPL: means the GNU General Public License version 2 or any
subsequent version, as published by the Free Software Foundation Inc.
GNU Affero GPL: means the GNU Affero General Public License version 3 or
any subsequent version, as published by the Free Software Foundation Inc.
EUPL: means the European Union Public License version 1.1 or any
subsequent version, as published by the European Commission.
Parties: mean both the Licensee and the Licensor.
These expressions may be used both in singular and plural form.
Article 2 - PURPOSE
The purpose of the Agreement is the grant by the Licensor to the
Licensee of a non-exclusive, transferable and worldwide license for the
Software as set forth in Article 5 <#scope> hereinafter for the whole
term of the protection granted by the rights over said Software.
Article 3 - ACCEPTANCE
3.1 The Licensee shall be deemed as having accepted the terms and
conditions of this Agreement upon the occurrence of the first of the
following events:
* (i) loading the Software by any or all means, notably, by
downloading from a remote server, or by loading from a physical medium;
* (ii) the first time the Licensee exercises any of the rights granted
hereunder.
3.2 One copy of the Agreement, containing a notice relating to the
characteristics of the Software, to the limited warranty, and to the
fact that its use is restricted to experienced users has been provided
to the Licensee prior to its acceptance as set forth in Article 3.1
<#accepting> hereinabove, and the Licensee hereby acknowledges that it
has read and understood it.
Article 4 - EFFECTIVE DATE AND TERM
4.1 EFFECTIVE DATE
The Agreement shall become effective on the date when it is accepted by
the Licensee as set forth in Article 3.1 <#accepting>.
4.2 TERM
The Agreement shall remain in force for the entire legal term of
protection of the economic rights over the Software.
Article 5 - SCOPE OF RIGHTS GRANTED
The Licensor hereby grants to the Licensee, who accepts, the following
rights over the Software for any or all use, and for the term of the
Agreement, on the basis of the terms and conditions set forth hereinafter.
Besides, if the Licensor owns or comes to own one or more patents
protecting all or part of the functions of the Software or of its
components, the Licensor undertakes not to enforce the rights granted by
these patents against successive Licensees using, exploiting or
modifying the Software. If these patents are transferred, the Licensor
undertakes to have the transferees subscribe to the obligations set
forth in this paragraph.
5.1 RIGHT OF USE
The Licensee is authorized to use the Software, without any limitation
as to its fields of application, with it being hereinafter specified
that this comprises:
1. permanent or temporary reproduction of all or part of the Software
by any or all means and in any or all form.
2. loading, displaying, running, or storing the Software on any or all
medium.
3. entitlement to observe, study or test its operation so as to
determine the ideas and principles behind any or all constituent
elements of said Software. This shall apply when the Licensee
carries out any or all loading, displaying, running, transmission or
storage operation as regards the Software, that it is entitled to
carry out hereunder.
5.2 ENTITLEMENT TO MAKE CONTRIBUTIONS
The right to make Contributions includes the right to translate, adapt,
arrange, or make any or all modifications to the Software, and the right
to reproduce the resulting software.
The Licensee is authorized to make any or all Contributions to the
Software provided that it includes an explicit notice that it is the
author of said Contribution and indicates the date of the creation thereof.
5.3 RIGHT OF DISTRIBUTION
In particular, the right of distribution includes the right to publish,
transmit and communicate the Software to the general public on any or
all medium, and by any or all means, and the right to market, either in
consideration of a fee, or free of charge, one or more copies of the
Software by any means.
The Licensee is further authorized to distribute copies of the modified
or unmodified Software to third parties according to the terms and
conditions set forth hereinafter.
5.3.1 DISTRIBUTION OF SOFTWARE WITHOUT MODIFICATION
The Licensee is authorized to distribute true copies of the Software in
Source Code or Object Code form, provided that said distribution
complies with all the provisions of the Agreement and is accompanied by:
1. a copy of the Agreement,
2. a notice relating to the limitation of both the Licensor's warranty
and liability as set forth in Articles 8 and 9,
and that, in the event that only the Object Code of the Software is
redistributed, the Licensee allows effective access to the full Source
Code of the Software for a period of at least three years from the
distribution of the Software, it being understood that the additional
acquisition cost of the Source Code shall not exceed the cost of the
data transfer.
5.3.2 DISTRIBUTION OF MODIFIED SOFTWARE
When the Licensee makes a Contribution to the Software, the terms and
conditions for the distribution of the resulting Modified Software
become subject to all the provisions of this Agreement.
The Licensee is authorized to distribute the Modified Software, in
source code or object code form, provided that said distribution
complies with all the provisions of the Agreement and is accompanied by:
1. a copy of the Agreement,
2. a notice relating to the limitation of both the Licensor's warranty
and liability as set forth in Articles 8 and 9,
and, in the event that only the object code of the Modified Software is
redistributed,
3. a note stating the conditions of effective access to the full source
code of the Modified Software for a period of at least three years
from the distribution of the Modified Software, it being understood
that the additional acquisition cost of the source code shall not
exceed the cost of the data transfer.
5.3.3 DISTRIBUTION OF EXTERNAL MODULES
When the Licensee has developed an External Module, the terms and
conditions of this Agreement do not apply to said External Module, that
may be distributed under a separate license agreement.
5.3.4 COMPATIBILITY WITH OTHER LICENSES
The Licensee can include a code that is subject to the provisions of one
of the versions of the GNU GPL, GNU Affero GPL and/or EUPL in the
Modified or unmodified Software, and distribute that entire code under
the terms of the same version of the GNU GPL, GNU Affero GPL and/or EUPL.
The Licensee can include the Modified or unmodified Software in a code
that is subject to the provisions of one of the versions of the GNU GPL,
GNU Affero GPL and/or EUPL and distribute that entire code under the
terms of the same version of the GNU GPL, GNU Affero GPL and/or EUPL.
Article 6 - INTELLECTUAL PROPERTY
6.1 OVER THE INITIAL SOFTWARE
The Holder owns the economic rights over the Initial Software. Any or
all use of the Initial Software is subject to compliance with the terms
and conditions under which the Holder has elected to distribute its work
and no one shall be entitled to modify the terms and conditions for the
distribution of said Initial Software.
The Holder undertakes that the Initial Software will remain ruled at
least by this Agreement, for the duration set forth in Article 4.2 <#term>.
6.2 OVER THE CONTRIBUTIONS
The Licensee who develops a Contribution is the owner of the
intellectual property rights over this Contribution as defined by
applicable law.
6.3 OVER THE EXTERNAL MODULES
The Licensee who develops an External Module is the owner of the
intellectual property rights over this External Module as defined by
applicable law and is free to choose the type of agreement that shall
govern its distribution.
6.4 JOINT PROVISIONS
The Licensee expressly undertakes:
1. not to remove, or modify, in any manner, the intellectual property
notices attached to the Software;
2. to reproduce said notices, in an identical manner, in the copies of
the Software modified or not.
The Licensee undertakes not to directly or indirectly infringe the
intellectual property rights on the Software of the Holder and/or
Contributors, and to take, where applicable, vis-à-vis its staff, any
and all measures required to ensure respect of said intellectual
property rights of the Holder and/or Contributors.
Article 7 - RELATED SERVICES
7.1 Under no circumstances shall the Agreement oblige the Licensor to
provide technical assistance or maintenance services for the Software.
However, the Licensor is entitled to offer this type of services. The
terms and conditions of such technical assistance, and/or such
maintenance, shall be set forth in a separate instrument. Only the
Licensor offering said maintenance and/or technical assistance services
shall incur liability therefor.
7.2 Similarly, any Licensor is entitled to offer to its licensees, under
its sole responsibility, a warranty, that shall only be binding upon
itself, for the redistribution of the Software and/or the Modified
Software, under terms and conditions that it is free to decide. Said
warranty, and the financial terms and conditions of its application,
shall be subject of a separate instrument executed between the Licensor
and the Licensee.
Article 8 - LIABILITY
8.1 Subject to the provisions of Article 8.2, the Licensee shall be
entitled to claim compensation for any direct loss it may have suffered
from the Software as a result of a fault on the part of the relevant
Licensor, subject to providing evidence thereof.
8.2 The Licensor's liability is limited to the commitments made under
this Agreement and shall not be incurred as a result of in particular:
(i) loss due the Licensee's total or partial failure to fulfill its
obligations, (ii) direct or consequential loss that is suffered by the
Licensee due to the use or performance of the Software, and (iii) more
generally, any consequential loss. In particular the Parties expressly
agree that any or all pecuniary or business loss (i.e. loss of data,
loss of profits, operating loss, loss of customers or orders,
opportunity cost, any disturbance to business activities) or any or all
legal proceedings instituted against the Licensee by a third party,
shall constitute consequential loss and shall not provide entitlement to
any or all compensation from the Licensor.
Article 9 - WARRANTY
9.1 The Licensee acknowledges that the scientific and technical
state-of-the-art when the Software was distributed did not enable all
possible uses to be tested and verified, nor for the presence of
possible defects to be detected. In this respect, the Licensee's
attention has been drawn to the risks associated with loading, using,
modifying and/or developing and reproducing the Software which are
reserved for experienced users.
The Licensee shall be responsible for verifying, by any or all means,
the suitability of the product for its requirements, its good working
order, and for ensuring that it shall not cause damage to either persons
or properties.
9.2 The Licensor hereby represents, in good faith, that it is entitled
to grant all the rights over the Software (including in particular the
rights set forth in Article 5 <#scope>).
9.3 The Licensee acknowledges that the Software is supplied "as is" by
the Licensor without any other express or tacit warranty, other than
that provided for in Article 9.2 <#good-faith> and, in particular,
without any warranty as to its commercial value, its secured, safe,
innovative or relevant nature.
Specifically, the Licensor does not warrant that the Software is free
from any error, that it will operate without interruption, that it will
be compatible with the Licensee's own equipment and software
configuration, nor that it will meet the Licensee's requirements.
9.4 The Licensor does not either expressly or tacitly warrant that the
Software does not infringe any third party intellectual property right
relating to a patent, software or any other property right. Therefore,
the Licensor disclaims any and all liability towards the Licensee
arising out of any or all proceedings for infringement that may be
instituted in respect of the use, modification and redistribution of the
Software. Nevertheless, should such proceedings be instituted against
the Licensee, the Licensor shall provide it with technical and legal
expertise for its defense. Such technical and legal expertise shall be
decided on a case-by-case basis between the relevant Licensor and the
Licensee pursuant to a memorandum of understanding. The Licensor
disclaims any and all liability as regards the Licensee's use of the
name of the Software. No warranty is given as regards the existence of
prior rights over the name of the Software or as regards the existence
of a trademark.
Article 10 - TERMINATION
10.1 In the event of a breach by the Licensee of its obligations
hereunder, the Licensor may automatically terminate this Agreement
thirty (30) days after notice has been sent to the Licensee and has
remained ineffective.
10.2 A Licensee whose Agreement is terminated shall no longer be
authorized to use, modify or distribute the Software. However, any
licenses that it may have granted prior to termination of the Agreement
shall remain valid subject to their having been granted in compliance
with the terms and conditions hereof.
Article 11 - MISCELLANEOUS
11.1 EXCUSABLE EVENTS
Neither Party shall be liable for any or all delay, or failure to
perform the Agreement, that may be attributable to an event of force
majeure, an act of God or an outside cause, such as defective
functioning or interruptions of the electricity or telecommunications
networks, network paralysis following a virus attack, intervention by
government authorities, natural disasters, water damage, earthquakes,
fire, explosions, strikes and labor unrest, war, etc.
11.2 Any failure by either Party, on one or more occasions, to invoke
one or more of the provisions hereof, shall under no circumstances be
interpreted as being a waiver by the interested Party of its right to
invoke said provision(s) subsequently.
11.3 The Agreement cancels and replaces any or all previous agreements,
whether written or oral, between the Parties and having the same
purpose, and constitutes the entirety of the agreement between said
Parties concerning said purpose. No supplement or modification to the
terms and conditions hereof shall be effective as between the Parties
unless it is made in writing and signed by their duly authorized
representatives.
11.4 In the event that one or more of the provisions hereof were to
conflict with a current or future applicable act or legislative text,
said act or legislative text shall prevail, and the Parties shall make
the necessary amendments so as to comply with said act or legislative
text. All other provisions shall remain effective. Similarly, invalidity
of a provision of the Agreement, for any reason whatsoever, shall not
cause the Agreement as a whole to be invalid.
11.5 LANGUAGE
The Agreement is drafted in both French and English and both versions
are deemed authentic.
Article 12 - NEW VERSIONS OF THE AGREEMENT
12.1 Any person is authorized to duplicate and distribute copies of this
Agreement.
12.2 So as to ensure coherence, the wording of this Agreement is
protected and may only be modified by the authors of the License, who
reserve the right to periodically publish updates or new versions of the
Agreement, each with a separate number. These subsequent versions may
address new issues encountered by Free Software.
12.3 Any Software distributed under a given version of the Agreement may
only be subsequently distributed under the same version of the Agreement
or a subsequent version, subject to the provisions of Article 5.3.4
<#compatibility>.
Article 13 - GOVERNING LAW AND JURISDICTION
13.1 The Agreement is governed by French law. The Parties agree to
endeavor to seek an amicable solution to any disagreements or disputes
that may arise during the performance of the Agreement.
13.2 Failing an amicable solution within two (2) months as from their
occurrence, and unless emergency proceedings are necessary, the
disagreements or disputes shall be referred to the Paris Courts having
jurisdiction, by the more diligent Party.
#!/usr/bin/env bash
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
CMD_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}" )" && pwd)
CMD_NAME=$(basename "${BASH_SOURCE[0]}")
SCRIPTS_DIR="${CMD_DIR}"/scripts
# Include common definitions
source "${CMD_DIR}/profileview-common"
# Pressing CTRL-C will stop the whole execution of the script
trap ctrl_c INT; function ctrl_c() { exit 5; }
# Definition of functions and global variables specific to this script
PVLIB_FORCE=false
PVLIB_PATH=""
PVLIB_DOMID=""
NTHREADS=2
NJOBS=8
PEXEC_CMD="parallel --halt now,fail=1 -j ${NJOBS}"
if ! command -v parallel >/dev/null 2>&1; then
print_warning "cannot find GNU parallel, all jobs will be run sequentially"
PEXEC_CMD="/usr/bin/env bash --"
fi
function print_usage() {
echo -en "\n USAGE: ${CMD_NAME} -i <input_fasta> -d <hhblits_db> -n <lib_name> [options]\n"
echo -en "\n"
echo -en " MANDATORY OPTIONS:\n
-i, --input <name>\tInput file of domain sequences in FASTA format\n
-D, --domain-id <name>\tDomain identifier\n
\t(e.g., Pfam accession number PF03441)\n
-d, --db <name>\thhblits database name\n
\t(multiple databases may be specified with '-d <name1> -d <name2> ...')\n
-n, --lib-name <name>\tName of ProfileView library\n" | column -t -s $'\t'
echo -en "\n"
echo -en " OTHER OPTIONS:\n
--force\tForce the construction of ProfileView's library even if the directory already exists\n
\tpossibly overwriting previously generated files\n
-t, --threads <num>\tNumber of threads for each hhblits job (default:2)\n
-j, --max-jobs <num>\tNumber of parallel jobs (default:8)\n
-h, --help\tPrint this help message\n
-V, --version\tPrint version\n" | column -t -s $'\t'
echo -en "\n"
}
# retrieve provided arguments
opts="i:D:d:n:t:j:hV"
longopts="input:,domain-id:,db:,lib-name:,force,threads:,max-jobs:,help,version"
ARGS=$(getopt -o "${opts}" -l "${longopts}" -n "${CMD_NAME}" -- "${@}")
if [ $? -ne 0 ] || [ $# -eq 0 ]; then # do not change the order of this test!
print_usage
exit 1
fi
eval set -- "${ARGS}"
declare -a DBNAMES
while [ -n "${1}" ]; do
case ${1} in
-i|--input)
shift
INPUT_FASTA=${1}
;;
-D|--domain-id)
shift
PVLIB_DOMID=${1}
;;
-d|--db)
shift
DBNAMES+=(${1})
;;
-n|--lib-name)
shift
PVLIB_PATH=${1}
;;
--force)
PVLIB_FORCE=true
;;
-t|--threads)
shift
NTHREADS=${1}
;;
-j|--max-jobs)
shift
NJOBS=${1}
;;
-h|--help)
print_usage
exit 0
;;
-V|--version)
print_version
exit 0
;;
--)
shift
break
;;
esac
shift
done
# Input parameters validation
if [ ${#DBNAMES[@]} -eq 0 ]; then
print_error "no hh-suite database provided with the -d|--db parameter"
exit 1
else
for db in "${DBNAMES[@]}"; do
db_dir=$(dirname "$db")
if [ ! -d "${db_dir}" ]; then
print_error "hh-suite database path \"${dbn_dir}\" cannot be found"
exit 1
fi
done
fi
if [ -z "${INPUT_FASTA}" ] || [ ! -f ${INPUT_FASTA} ]; then
print_error "-i|--input file is missing or does not exist"
exit 1
fi
if [ -z "${PVLIB_DOMID}" ]; then
print_error "-D|--domain-id parameter is mandatory"
exit 1
fi
if [ -z "${PVLIB_PATH}" ]; then
print_error "-n|--name parameter is mandatory"
exit 1
elif [ "${PVLIB_FORCE}" = false ] && [ -d "${PVLIB_PATH}" ]; then
print_error "it seems that profileview library \"${PVLIB_PATH}\" already exists; delete its directory or use a different name."
exit 1
fi
if ! [[ "${NTHREADS}" =~ ^[0-9]+$ ]] || [ ${NTHREADS} -lt 1 ] ; then
print_warning "-t|--threads parameter should be a positive integer; the default value of 1 will be used."
NTHREADS=1
fi
if ! [[ "${NJOBS}" =~ ^[0-9]+$ ]] || [ ${NJOBS} -lt 1 ] ; then
print_warning "-j|--max-jobs parameter should be a positive integer; the default value of 1 will be used."
NJOBS=1
fi
check_cmds "hhblits" "reformat.pl" "hmmbuild" "python3"
# Create ProfileView library directory
PVLIB_NAME=$(basename "${PVLIB_PATH}")
PVLIB_PREFIX="$(dirname "${PVLIB_PATH}")"/"${PVLIB_NAME}"
mkdir -p ${PVLIB_PREFIX}/{query,a3m,hhm,afa,fas,hmm}
if [ $? -ne 0 ]; then
print_error "cannot create all library directories"
exit 1
fi
PVLIB_DIR=$(abspath "${PVLIB_PREFIX}")
PVLIB_QUERY="${PVLIB_DIR}"/query
# First split input fasta file
print_status "splitting input file: ${INPUT_FASTA}"
awk -v queryDir="${PVLIB_QUERY}" '
/^>/ {
idstr=substr($0,2)
gsub(/[^A-Za-z0-9._-]/,"_",idstr)
split(idstr,ida)
sname=ida[1]
printf(">%s\n",idstr) >queryDir"/"sname".fa"
next
}
sname!="" {
print >queryDir"/"sname".fa"
}
' "${INPUT_FASTA}"
# Run hhblits on each sequences of the fas directory
#TODO: do not execute hhblits if model had been already built
print_status "building hhblits models"
HHBLITS_DBS=""; for hhdb in "${DBNAMES[@]}"; do HHBLITS_DBS="${HHBLITS_DBS} -d ${hhdb}"; done
for query in "${PVLIB_QUERY}"/*.fa; do
[ -e "${query}" ] || continue
queryBase=${query##*/}
queryName=${queryBase%.fa}
echo "hhblits ${HHBLITS_DBS} -i ${query} -o stdout -oa3m ${PVLIB_PREFIX}/a3m/${queryName}.a3m -ohhm ${PVLIB_PREFIX}/hhm/${queryName}.hhm -M first -qid 60 -cov 70 -id 98 -e 1e-10 -n 3 -cpu ${NTHREADS} -v 0 >/dev/null 2>&1"
done | ${PEXEC_CMD} # left it unquoted!
print_status "formatting a3m files"
for a3m in "${PVLIB_DIR}"/a3m/*.a3m; do
[ -e "${a3m}" ] || continue
a3mBase="${a3m##*/}"
a3mName="${a3mBase%.a3m}"
afaPath="${PVLIB_DIR}/afa/${a3mName}.afa"
fasPath="${PVLIB_DIR}/fas/${a3mName}.fas"
echo "reformat.pl a3m fas ${a3m} ${afaPath} -uc -l 80 -v 0 >/dev/null 2>&1 && reformat.pl a3m fas ${a3m} ${fasPath} -M first -r -uc -l 80 -v 0 >/dev/null 2>&1"
done | ${PEXEC_CMD} # left it unquoted!
print_status "building HMMER models"
for afa in "${PVLIB_DIR}"/afa/*.afa; do
[ -e "${afa}" ] || continue
afaBase="${afa##*/}"
afaName="${afaBase%.afa}"
hmm="${PVLIB_DIR}/hmm/${afaName}.hmm"
echo "hmmbuild ${hmm} ${afa} >/dev/null 2>&1"
done | ${PEXEC_CMD} # left it unquoted!
print_status "generating ProfileView library data"
for hmm in "${PVLIB_DIR}"/hmm/*.hmm; do
[ -e "${hmm}" ] || continue
awk -v domId="${PVLIB_DOMID}" '$1~/^NAME$/{hmm_name=$2;next} $1~/^LENG$/{hmm_len=$2;next} $1~/^NSEQ$/{hmm_nseq=$2;next} $1~/^HMM$/{ OFS=","; print hmm_name,domId,hmm_nseq,hmm_len }' ${hmm}
done > "${PVLIB_PREFIX}"/"${PVLIB_NAME}".models.list
python3 "${SCRIPTS_DIR}"/createHHdict.py --hhm-dir "${PVLIB_DIR}/hhm" --prefix "${PVLIB_DIR}"/${PVLIB_NAME}.hhdict >/dev/null 2>&1
python3 "${SCRIPTS_DIR}"/createHmmerDict.py --hmm-dir "${PVLIB_DIR}/hmm" --prefix "${PVLIB_DIR}"/${PVLIB_NAME}.hmmdict >/dev/null 2>&1
#!/usr/bin/env bash
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
# Common functions and definitions to be used in ProfileView
PV_NAME="ProfileView"
PV_VERSION='1.0'
PV_DATE='190531'
function abspath() { echo "$(cd "$(dirname "$1")"; pwd -P)/$(basename "$1")"; }
function print_error() { echo "[error] ${@}" >&2; }
function print_warning() { echo "[warning] ${@}" >&2; }
function print_status() { echo "[main] ${@}" >&2; }
function print_version() { echo "${PV_NAME} ${PV_VERSION}-${PV_DATE}"; }
function check_cmds() {
cmds=("$@")
declare -a notfound
for cmd in "${cmds[@]}"; do
if ! command -v "${cmd}" >/dev/null 2>&1; then
notfound+=("${cmd}")
fi
done
if [ ${#notfound[@]} -gt 0 ]; then
print_error "cannot find the following commands: ${notfound[@]}"
print_error "check your PATH environment variable"
exit 1
fi
return 0
}
function check_pymodules(){
check_cmds "python3"
mods=("$@")
declare -a notfound
for mod in "${mods[@]}"; do
if ! python3 -c "import ${mod}" >/dev/null 2>&1; then
notfound+=("${mod}")
fi
done
if [ ${#notfound[@]} -gt 0 ]; then
print_error "cannot find the following python3 modules: ${notfound[@]}"
exit 1
fi
return 0
}
function check_files() {
files=("$@")
declare -a notfound
for f in "${files[@]}"; do
if [ ! -e "${f}" ]; then notfound+=("${f}"); fi
done
if [ ${#notfound[@]} -gt 0 ]; then
print_error "cannot find the following files:"
for f in "${notfound[@]}"; do echo "$f"; done
exit 1
fi
return 0
}
#!/usr/bin/env bash
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
CMD_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}" )" && pwd)
CMD_NAME=$(basename "${BASH_SOURCE[0]}")
SCRIPTS_DIR="${CMD_DIR}"/scripts
# Include common definitions
source "${CMD_DIR}/profileview-common"
# Pressing CTRL-C will stop the whole execution of the script
trap ctrl_c INT;
function ctrl_c() { exit 5; }
# Definition of functions and global variables specific to this script
MODLIST=""
PV_LIBPATH=""
PV_OUTTREE=""
PV_SEQDESC=""
PV_OUTDIR="out_$(date +%Y%m%d_%H%M%S)"
PV_TMPDIR=""
NJOBS=8
PEXEC_CMD="parallel --halt now,fail=1 -j ${NJOBS}"
if ! command -v parallel >/dev/null 2>&1; then
print_warning "cannot find GNU parallel, all jobs will be run sequentially"
PEXEC_CMD="/usr/bin/env bash --"
fi
function print_usage() {
echo -en "\n USAGE: ${CMD_NAME} -l <profileview_lib> -m <model-list> [options]\n"
echo -en "\n"
echo -en " MANDATORY OPTIONS:\n
-l, --lib <name>\tProfileView library name\n
-m, --model-list <name>\tFile containing a list of models (one identifier per line)\n
" | column -t -s $'\t'
echo -en "\n"
echo -en " OTHER OPTIONS:\n
-o, --out-dir <name>\tPrefix of output files (default: out_<current-date>)\n
--temp-dir <name>\tTemporary result directory\n
-j, --max-jobs <num>\tNumber of parallel jobs (default:8)\n
-h, --help\tPrint this help message and exit\n
-V, --version\tPrint version and exit\n
" | column -t -s $'\t'
echo -en "\n"
}
# retrieve provided arguments
opts="l:m:o:j:hV"
longopts="lib:,models:,out-dir:,temp-dir:,max-jobs:,help,version"
ARGS=$(getopt -o "${opts}" -l "${longopts}" -n "${CMD_NAME}" -- "${@}")
if [ $? -ne 0 ] || [ $# -eq 0 ]; then # the order of this tests is important!
print_usage
exit 2
fi
eval set -- "${ARGS}"
declare -a DBNAMES
while [ -n "${1}" ]; do
case ${1} in
-l|--lib)
shift
PV_LIBPATH=${1}
;;
-m|--models)
shift
MODLIST=${1}
;;
-o|--out-dir)
shift
PV_OUTDIR=${1}
;;
--temp-dir)
shift
PV_TMPDIR=${1}
;;
-j|--max-jobs)
shift
NJOBS=${1}
;;
-h|--help)
print_usage
exit 0
;;
-V|--version)
print_version
exit 0
;;
--)
shift
break
;;
esac
shift
done
# Pre-requisites check
check_cmds "hhalign" "python3"
check_pymodules "Bio" "weblogolib" "BitVector"
# Input parameters validation
if [ -z "${PV_LIBPATH}" ]; then
print_error "-l|--lib parameter is mandatory"
print_usage
exit 1
elif [ ! -d "${PV_LIBPATH}" ]; then
print_error "ProfileView library not found: ${PV_LIBPATH}"
exit 1
fi
PV_LIBNAME=$(basename "${PV_LIBPATH}")
PV_LIBDIR=$(abspath "${PV_LIBPATH}")
if [ -z "${MODLIST}" ]; then
print_error "-m|--model-list parameter is mandatory"
print_usage
exit 1
elif [ ! -f "${MODLIST}" ]; then
print_error "cannot find model list file: ${MODLIST}"
exit 1
fi
# Create temp working directory
if [ -z "${PV_TMPDIR}" ]; then
PV_TMPDIR=$(mktemp -p "${TMPDIR:-.}" -d pvtmp-XXXXX) || {
print_error "cannot create temporary directory"
exit 1
}
fi
PV_TMPDIR=$(abspath "${PV_TMPDIR}")
if [ ! -d "${PV_TMPDIR}" ] || [ -n "$(ls -A "${PV_TMPDIR}")" ]; then
print_error "provided path \"${PV_TMPDIR}\" must be an empty directory"
exit 1
fi
print_status "using temporary directory: ${PV_TMPDIR}"
# Create log file
PV_LOGFILE="${PV_TMPDIR}/profileview-tree_$(date +%Y%m%d_%H%M%S).log"
touch "${PV_LOGFILE}" && print_status "a log file is saved in ${PV_LOGFILE}"
# define model paths
declare -a modlst
while read -r line; do
if [ -z $line ]; then continue; fi # skip empty lines
mod_path="${PV_LIBDIR}/hhm/${line}.hhm"
modlst+=("${mod_path}")
done <<< "$(awk -F, '!/^#/ && NF>0{printf("%s\n",$1)}' "${MODLIST}")"
###################
##### HHALIGN #####
###################
print_status "running hhalign jobs for pairwise comparison of models"
HHALIGN_HHRDIR="${PV_TMPDIR}/hhr"
for query in "${modlst[@]}"; do
[ -e "${query}" ] || continue
queryBase=${query##*/}
queryName=${queryBase%.hhm}
mkdir -p "${HHALIGN_HHRDIR}/${queryName}"
for target in "${modlst[@]}"; do
[ -e "${target}" ] || continue
targetBase=${target##*/}
targetName=${targetBase%.hhm}
[ "${queryName}" = "${targetName}" ] || echo "hhalign -i ${query} -t ${target} -o ${HHALIGN_HHRDIR}/${queryName}/${targetName}.hhr -v 0"
done
done | ${PEXEC_CMD} >>"${PV_LOGFILE}" 2>>"${PV_LOGFILE}"
if [ $? -ne 0 ]; then
print_error "error during hhalign jobs, see log: ${PV_LOGFILE}"
exit 1
fi
# merge hhalign results
print_status "merging hhalign job results"
HHALIGN_RESDIR="${PV_TMPDIR}/hhalign"
mkdir -p "${HHALIGN_RESDIR}"
for query in "${modlst[@]}"; do
[ -e "${query}" ] || continue
queryBase=${query##*/}
queryName=${queryBase%.hhm}
#mkdir -p "${HHALIGN_HHRDIR}/${queryName}"
for target in "${modlst[@]}"; do
[ -e "${target}" ] || continue
targetBase=${target##*/}
targetName=${targetBase%.hhm}
[ "${queryName}" = "${targetName}" ] || cat "${HHALIGN_HHRDIR}/${queryName}/${targetName}.hhr"
done >"${HHALIGN_RESDIR}/${queryName}.hhr"
done
print_status "generating motifs"
HHDICT="${PV_LIBDIR}/${PV_LIBNAME}.hhdict.pgz"
mkdir -p "${PV_OUTDIR}"
if [ $? -ne 0 ]; then
print_error "cannot create output directory"
exit 1
fi
python3 "${SCRIPTS_DIR}/parseHHR.py" --hhm-list "${MODLIST}" --hhm-dict "${HHDICT}" --hhr-dir "${HHALIGN_RESDIR}" --fas-dir "${PV_LIBDIR}/fas" --out-dir "${PV_OUTDIR}" 2>>"${PV_LOGFILE}"
#!/usr/bin/env bash
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
CMD_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}" )" && pwd)
CMD_NAME=$(basename "${BASH_SOURCE[0]}")
SCRIPTS_DIR="${CMD_DIR}"/scripts
# Include common definitions
source "${CMD_DIR}/profileview-common"
# Pressing CTRL-C will stop the whole execution of the script
trap ctrl_c INT;
function ctrl_c() { exit 5; }
# Definition of functions and global variables specific to this script
INPUT_FASTA=""
PV_LIBPATH=""
PV_OUTTREE=""
PV_SEQDESC=""
PV_OUTPREFIX="out"
PV_TMPDIR=""
NJOBS=8
PEXEC_CMD="parallel --halt now,fail=1 -j ${NJOBS}"
if ! command -v parallel >/dev/null 2>&1; then
print_warning "cannot find GNU parallel, all jobs will be run sequentially"
PEXEC_CMD="/usr/bin/env bash --"
fi
function print_usage() {
echo -en "\n USAGE: ${CMD_NAME} -i <input_fasta> -l <profileview_lib> [options]\n"
echo -en "\n"
echo -en " MANDATORY OPTIONS:\n
-i, --input <name>\tFile containing the sequences to be classified in FASTA format\n
-l, --lib <name>\tProfileView library name\n" | column -t -s $'\t'
echo -en "\n"
echo -en " OTHER OPTIONS:\n
-o, --out-tree <name>\tOutput file which will contain the ProfileView tree\n
\t(if not provided, it will be printed on the standard output)\n
-s, --seq-desc <name>\tInput sequence descriptor file, that is a CSV file containing the follwing fileds:\n
\t<sequence_id>,<function_id>,<family_id>,<sequence_length>\n
-p, --prefix <name>\tPrefix of output files (default:${PV_OUTPREFIX})\n
--temp-dir <name>\tTemporary result directory\n
-j, --max-jobs <num>\tNumber of parallel jobs (default:8)\n
-h, --help\tPrint this help message\n
-V, --version\tPrint version\n" | column -t -s $'\t'
echo -en "\n"
}
# retrieve provided arguments
opts="i:l:o:s:p:j:hV"
longopts="input:,lib:,out-tree:,seq-desc:,prefix:,temp-dir:,max-jobs:,help,version"
ARGS=$(getopt -o "${opts}" -l "${longopts}" -n "${CMD_NAME}" -- "${@}")
if [ $? -ne 0 ] || [ $# -eq 0 ]; then # the order of this tests is important!
print_usage
exit 2
fi
eval set -- "${ARGS}"
declare -a DBNAMES
while [ -n "${1}" ]; do
case ${1} in
-i|--input)
shift
INPUT_FASTA=${1}
;;
-l|--lib)
shift
PV_LIBPATH=${1}
;;
-o|--out-tree)
shift
PV_OUTTREE=${1}
;;
-s|--seq-desc)
shift
PV_SEQDESC=${1}
;;
-p|--prefix)
shift
PV_OUTPREFIX=${1}
;;
--temp-dir)
shift
PV_TMPDIR=${1}
;;
-j|--max-jobs)
shift
NJOBS=${1}
;;
-h|--help)
print_usage
exit 0
;;
-V|--version)
print_version
exit 0
;;
--)
shift
break
;;
esac
shift
done
# Input parameters validation
if [ -z "${INPUT_FASTA}" ]; then
print_error "-i|--input parameter is mandatory"
print_usage
exit 1
elif [ ! -f "${INPUT_FASTA}" ]; then
print_error "Input file does not exist: ${INPUT_FASTA}"
exit 1
fi
if [ -z "${PV_LIBPATH}" ]; then
print_error "-l|--lib parameter is mandatory"
print_usage
exit 1
elif [ ! -d "${PV_LIBPATH}" ]; then
print_error "ProfileView library not found: ${PV_LIBPATH}"
exit 1
fi
PV_LIBNAME=$(basename "${PV_LIBPATH}")
PV_LIBDIR=$(abspath "${PV_LIBPATH}")
if [ ! -z "${PV_SEQDESC}" ] && [ ! -f "${PV_SEQDESC}" ]; then
print_error "Cannot find sequence descriptor file: ${PV_SEQDESC}"
exit 1
fi
if ! [[ "${NJOBS}" =~ ^[0-9]+$ ]] || [ ${NJOBS} -lt 1 ] ; then
print_warning "-j|--max-jobs parameter should be a positive integer; the default value of 1 will be used."
NJOBS=1
fi
check_cmds "hmmsearch" "python3" "Rscript"
check_pymodules "ete3" "numpy"
#check_files "${SCRIPTS_DIR}"/{createHHdict.py,createHmmerDict.py,hh_utils.py,pv_utils.py}
# Create temp working directory
if [ -z "${PV_TMPDIR}" ]; then
PV_TMPDIR=$(mktemp -p "${TMPDIR:-.}" -d pvtmp-XXXXX) || {
print_error "cannot create temporary directory"
exit 1
}
fi
PV_TMPDIR=$(abspath "${PV_TMPDIR}")
if [ ! -d "${PV_TMPDIR}" ] || [ -n "$(ls -A "${PV_TMPDIR}")" ]; then
print_error "provided path \"${PV_TMPDIR}\" must be an empty directory"
exit 1
fi
print_status "using temporary directory: ${PV_TMPDIR}"
# Create log file
PV_LOGFILE="${PV_TMPDIR}/profileview-tree_$(date +%Y%m%d_%H%M%S).log"
touch "${PV_LOGFILE}" && print_status "a log file is saved in ${PV_LOGFILE}"
# Possibly create a sequence descriptor file (if not provided)
if [ -z "${PV_SEQDESC}" ]; then
PV_SEQDESC="${PV_TMPDIR}/sequences.csv"
print_status "creating temporary sequence descriptor file: ${PV_SEQDESC}"
awk '
BEGIN {
sname=""
seqlen=0
OFS=","
}
/^>/ {
if(sname!=""){ print sname,"NA","NA",seqlen }
sname=substr($1,2)
seqlen=0
next
}
{
for(i=1;i<=NF;i++){seqlen+=length($i)}
}
END {
if(sname!=""){ print sname,"NA","NA",seqlen }
}
' "${INPUT_FASTA}" >"${PV_SEQDESC}"
else
cp "${PV_SEQDESC}" "${PV_TMPDIR}/sequences.csv"
PV_SEQDESC="${PV_TMPDIR}/sequences.csv"
fi
# Run hmmsearch for each model of the library against the input sequences
print_status "running hmmsearch against input sequences"
HMMSEARCH_RESDIR="${PV_TMPDIR}"/hmmsearch
mkdir -p "${HMMSEARCH_RESDIR}"
for hmm in "${PV_LIBDIR}"/hmm/*.hmm; do
[ -e "${hmm}" ] || continue
hmmBase=${hmm##*/}
hmmName=${hmmBase%.hmm}
echo "hmmsearch -o ${HMMSEARCH_RESDIR}/${hmmName}.out ${hmm} ${INPUT_FASTA} 2>>${PV_LOGFILE}"
done | ${PEXEC_CMD}
if [ $? -ne 0 ]; then
print_error "error during hmmsearch jobs, see log: ${PV_LOGFILE}"
exit 1
fi
print_status "processing hmmsearch output files"
PV_HMMLIB="${PV_LIBDIR}"/"${PV_LIBNAME}".hmmdict.pgz
PV_SCOREFILE="${PV_TMPDIR}"/hmmsearch.scores
for f in "${HMMSEARCH_RESDIR}"/*.out; do
[ -e "${f}" ] || continue
cat "${f}"
done | python3 "${SCRIPTS_DIR}"/parseHMMER.py --hmmer-dict "${PV_HMMLIB}" --seq-db "${PV_SEQDESC}" >"${PV_SCOREFILE}" 2>>"${PV_LOGFILE}"
if [ $? -ne 0 ]; then
print_error "error during processing of hmmsearch results, see log: ${PV_LOGFILE}"
exit 1
fi
print_status "filtering sequences"
PV_SEQDESC="${PV_TMPDIR}/sequences.filtered.csv"
awk '/^#/{next} !x[$3]++{OFS=",";print $3,$6,$5,$4}' "${PV_SCOREFILE}" >"${PV_SEQDESC}" 2>/dev/null
print_status "building representation space"
python3 "${SCRIPTS_DIR}"/generateFeatures.py --seq-list "${PV_SEQDESC}" --hmm-list "${PV_LIBDIR}/${PV_LIBNAME}.models.list" --scores "${PV_SCOREFILE}" --prefix "${PV_TMPDIR}"/out -n 20 -k 3 2>>"${PV_LOGFILE}"
if [ $? -ne 0 ]; then
print_error "error during feature generation, see log: ${PV_LOGFILE}"
exit 1
fi
print_status "building ProfileView tree"
Rscript --vanilla "${SCRIPTS_DIR}"/clusterSequences.R "${PV_TMPDIR}/out.feat" "${PV_TMPDIR}/out.tree" 2>>"${PV_LOGFILE}"
if [ $? -ne 0 ]; then
print_error "could not create ProfileView tree, see log: ${PV_LOGFILE}"
exit 1
fi
print_status "finding representative models and generating annotated ProfileView tree"
python3 "${SCRIPTS_DIR}/findReprModels.py" -t "${PV_TMPDIR}/out.tree" -s "${PV_SCOREFILE}" -m "${PV_TMPDIR}/out.used_models.list" -o "${PV_OUTPREFIX}" 2>>"${PV_LOGFILE}"
if [ $? -ne 0 ]; then
print_error "error during identification of representative models and the construction of the annotated ProfileView tree, see log: ${PV_LOGFILE}"
exit 1
fi
#!/usr/bin/env Rscript
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
library(ape)
args = commandArgs(trailingOnly=TRUE)
if (length(args)==0) {
stop("at least one argument must be supplied")
} else if (length(args)==1) {
args[2] <- ''
}
feat <- read.table( args[1], row.names=1, header=TRUE, sep="\t", na.strings=c("") )
pc <- prcomp(feat[,c(-1,-2)], scale=T)
eigs <- pc$sdev^2
cumvar = cumsum(eigs)/sum(eigs) # cumulative explaned variance of PCs
pc_i = min( c(length(cumvar),which(cumvar >= .99)) ) # get the PC that allows to explain at least the 99% of the variance
d <- dist(pc$x[,1:pc_i])
hc <- hclust(d,method="ward.D2")
my_tree <- as.phylo(hc)
if ( args[2] == '' ) {
write.tree(phy=my_tree,file=stdout())
} else {
write.tree(phy=my_tree,file=args[2])
}
#!/usr/bin/env python3
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
import sys, os, argparse
import time
import re
import math
import pickle, gzip
from pv_utils import *
from hh_utils import hhm, parse_hhm
script_name = re.sub( r'\.py$', '', os.path.basename(__file__) )
def main(argv=None):
start = time.perf_counter()
parser = argparse.ArgumentParser()
parser.add_argument('--hhm-file', dest='hhmFile', type=str, help='hhm file path')
parser.add_argument('--hhm-dir', dest='hhmDir', type=str, help='hhm directory path')
parser.add_argument('--prefix', dest='outPrefix', default="hhdict", type=str, help='output prefix of generated dictionary')
args = parser.parse_args()
# parameters check
validParameters = True;
if args.hhmFile is None and args.hhmDir is None:
validParameters = False
print_error('either one of --hhm-file or --hhm-dir arguments must be set')
if args.hhmFile is not None and args.hhmDir is not None:
validParameters = False
print_error('it is not possible to use both --hhm-file and --hhm-dir arguments')
elif args.hhmFile is not None and not os.path.isfile(args.hhmFile):
validParameters = False
print_error('--hhm-file \"{}\" is not a file'.format(args.hhmFile))
elif args.hhmDir is not None and not os.path.isdir(args.hhmDir):
validParameters = False
print_error('--hhm-dir \"{}\" is not a directory'.format(args.hhmDir))
if not validParameters:
return 1
fileList = []
if args.hhmFile is not None:
fileList = [ args.hhmFile ]
if args.hhmDir is not None:
hhms_dir = os.path.abspath(args.hhmDir)
fileList = [ '{}/{}'.format(hhms_dir,f) for f in os.listdir(hhms_dir) if f.endswith('.hhm') and os.path.isfile('{}/{}'.format(hhms_dir,f)) ]
if len(fileList) == 0:
print_warning('no hhm file found in directory "{}"'.format(args.hhmDir))
print_status("parsing HHM files")
hhmDict = {}
for fname in fileList:
with open(fname,'r') as fh:
hhm = parse_hhm(fh)
hhmDict[ hhm.name ] = hhm
hhmDictFileName = "{}.pgz".format(args.outPrefix)
print_status("writing hhm dictionary to \"{}\"".format(script_name,hhmDictFileName))
with gzip.GzipFile(hhmDictFileName, 'w') as hhmDictFile:
pickle.dump( hhmDict, hhmDictFile )
exec_time = time.perf_counter() - start
print_status('execution time: {:.2f}s'.format(script_name,exec_time));
return 0
# Check if the program is not being imported
if __name__ == "__main__":
sys.exit(main())
#!/usr/bin/env python3
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
import sys, os, argparse
import time
import gzip, pickle
from pv_utils import *
from hmm_utils import *
# a HMMER3 model is assumed
def parseHmmerModel( fh ):
hmm = hmm_model()
while True:
line = fh.readline()
if not line:
break
tokens = line.split()
if len(tokens) == 0:
continue
elif tokens[0] == "NAME":
hmm.name = tokens[1]
elif tokens[0] == "LENG":
hmm.length = int(tokens[1])
elif tokens[0] == "NSEQ":
hmm.nseq = int(tokens[1])
elif tokens[0] == "HMM":
fh.readline() # skip line after HMM: m->m m->i m->d i->m i->i d->m d->d
stateCount = 1
line = fh.readline()
while line and not line.startswith("//"):
tokens = line.split()
if len(tokens) == 26 and stateCount == int(tokens[0]):
hmm.states.append([float(v) for v in tokens[1:21]])
stateCount += 1
elif len(tokens) == 21 and tokens[0] == "COMPO":
hmm.bg_freq = [float(v) for v in tokens[1:21]]
line = fh.readline()
if line.startswith("//") and hmm.is_valid():
hmm.states = [hmm.bg_freq] + hmm.states
yield hmm
hmm = hmm_model()
def main(argv=None):
start = time.perf_counter()
parser = argparse.ArgumentParser()
parser.add_argument('--hmm-file', dest='hmmFile', type=str, help='hmm file path')
parser.add_argument('--hmm-dir', dest='hmmDir', type=str, help='hmm directory path')
parser.add_argument('--prefix', dest='outPrefix', default="hmmerDict", type=str, help='output prefix of generated dictionary')
args = parser.parse_args()
# parameters check
validParameters = True;
if args.hmmFile is None and args.hmmDir is None:
validParameters = False
print_error('either one of --hmm-file or --hmm-dir arguments must be set')
if args.hmmFile is not None and args.hmmDir is not None:
validParameters = False
print_error('it is not possible to use both --hmm-file and --hmm-dir arguments')
elif args.hmmFile is not None and not os.path.isfile(args.hmmFile):
validParameters = False
print_error('--hmm-file \"{}\" is not a file'.format(args.hmmFile))
elif args.hmmDir is not None and not os.path.isdir(args.hmmDir):
validParameters = False
print_error('--hmm-dir \"{}\" is not a directory'.format(args.hmmDir))
if not validParameters:
return 1
start = time.clock()
fileList = []
if args.hmmFile is not None:
fileList = [ args.hmmFile ]
if args.hmmDir is not None:
hmmdir = args.hmmDir[:-1] if args.hmmDir.endswith("/") else args.hmmDir
fileList = [ '{}/{}'.format(hmmdir,f) for f in os.listdir(hmmdir) if f.endswith('.hmm') and os.path.isfile('{}/{}'.format(hmmdir,f)) ]
print_status("parsing HMMER3 models");
hmmDict = {}
for fname in fileList:
with open(fname,'r') as fh:
for hmm in parseHmmerModel(fh):
hmmDict[ hmm.name ] = hmm
#print(str(hmm))
hmmDictFileName = "{}.pgz".format(args.outPrefix)
print_status("writing ProfileView's HMM dictionary to \"{}\"".format(hmmDictFileName))
with gzip.GzipFile(hmmDictFileName, 'w') as hmmDictFile:
pickle.dump( hmmDict, hmmDictFile )
exec_time = time.perf_counter() - start
print_status('execution time: {:.2f}s'.format(exec_time));
return 0
# Check if the program is not being imported
if __name__ == "__main__":
sys.exit(main())
#!/usr/bin/env python3
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
import sys, errno, os
import argparse
import math
from collections import defaultdict
import numpy as np
from ete3 import Tree
from pv_utils import *
def getBestSubtreeThreshold( mod_name, seq_scores, seq_names, seq_subtree ):
thr = [0.0,0.0]
points = [ seq_scores[i] for i,s in enumerate(seq_names) if s not in seq_subtree ]
return np.amax(points,axis=0) if len(points) > 0 else thr
def getCentroid(points):
a = np.asarray(points)
num = a.shape[0]
return [ np.sum(a[:,0])/num, np.sum(a[:,1])/num ] if num > 0 else [0.0,0.0]
def getRepresentedSequences( mod_name, seq_scores, seq_names, seq_subtree ):
thr = getBestSubtreeThreshold( mod_name, seq_scores, seq_names, seq_subtree )
indices = np.nonzero(np.any(np.greater(seq_scores,thr),axis=1))[0]
scores = [ seq_scores[i] for i in indices ]
names = uniq( seq_names[i] for i in indices )
ctr = getCentroid(scores)
ctr_norm = math.hypot(*ctr) if ctr is not None else 0.0
return { "model":mod_name, "covered":names, "num_sequences":len(seq_subtree), "covered_score":ctr_norm, "thr":thr }
def getBestRepresentativeModel( best, a ):
best_covered = len(best["covered"])
best_score = best["covered_score"]
a_covered = len(a["covered"])
a_score = a["covered_score"]
if (a_covered > best_covered) or (a_covered == best_covered and a_score > best_score):
best = a
return best
def hasBetterCoverage(a,b):
a_covered = len(a["covered"])
b_covered = len(b["covered"])
return a_covered > b_covered
def main( argv = None ):
# GET PARAMETERS
parser = argparse.ArgumentParser()
parser.add_argument('-t', '--tree', dest='treeFile', required=True, help='Input tree (in newick format) obtained from the hierarchical clustering')
parser.add_argument('-s', '--scores', dest='scoreFile', required=True, help='Score file obtained by the script parseHMMER.py')
parser.add_argument('-m', '--models', dest='modelFile', help='List of model identifiers to consider')
parser.add_argument('--sorted', dest='isScoreSorted', action='store_true', help='Use this flag if score file is sorted by score (descending order)')
parser.add_argument('--min-coverage', dest='minCoverage', type=float, default=0.5)
parser.add_argument('-o', '--out', dest='outFile', default='out.reprTree', help='output prefix')
opts = parser.parse_args()
# VALIDATE PARAMETERS
validParameters = True
if not os.path.isfile(opts.treeFile):
validParameters = False
print_error("input tree file \"" + opts.treeFile + "\" does not exist.")
if not os.path.isfile(opts.scoreFile):
validParameters = False
print_error("score file \"" + opts.scoreFile + "\" does not exist.")
if not validParameters:
return 1
# LOAD MODEL FILE (IF PROVIDED)
modSet = set()
if opts.modelFile is not None:
with open(opts.modelFile,'r') as fh:
for line in fh:
modid = line.strip()
if modid != "":
modSet.add(modid)
# LOAD FUNCTION TREE
funTree = Tree(opts.treeFile)
seqSet = set([ n.name for n in funTree.iter_leaves() ])
# LOAD SCORE FILE
modSeqs = defaultdict(list)
modScores = defaultdict(list)
with open(opts.scoreFile,'r') as fh:
for line in fh:
if not line.startswith('#'):
fields = line.strip().split('\t') # hmm_name, hmm_len, seq_name, seq_len, seq_family, seq_func, bitscore, mean-bitscore, ...
mod_name = fields[0]
seq_name = fields[2]
if opts.modelFile is None: # no model file given, cosider the set of models mentioned in the score file
modSet.add(mod_name)
if seq_name in seqSet:
modSeqs[mod_name].append(seq_name)
modScores[mod_name].append([float(fields[x]) for x in [7,11]]) # [ bitscore, mean-bitscore ]
seqDict = defaultdict(set)
for mod_name in modScores: # in modDict
if mod_name in modSet: # and (len(modDict[mod_name]) > 0):
for i in np.argmax(modScores[mod_name],axis=0):
seq_name = modSeqs[mod_name][i]
seqDict[seq_name].add(mod_name)
# (POST-ORDER) VISIT OF THE FUNCTIONAL TREE TO FIND REPRESENTATIVE MODELS
nameIdx=1
for node in funTree.iter_descendants("postorder"):
if node.is_leaf(): # do not find best representative for leaves
node.add_feature("models",seqDict[node.name])
node.add_feature("sequences",set([node.name]))
node.add_feature("num_covered",0)
node.add_feature("mod_support",0.0)
#print( 'LEAF({}): {}'.format(node.name,len(node.models)) )
continue
# internal node (different from the root)
assert len(node.children) == 2
node.name = 'I{}'.format(nameIdx)
nameIdx += 1
# gather models and sequences from the children
node.add_feature("models", node.children[0].models | node.children[1].models)
node.add_feature("sequences", node.children[0].sequences | node.children[1].sequences)
#print( 'INTERNAL({}): {}'.format(node.name,node.models) )
# test whether any of the model is representative for the sequence set
best = { "model":"", "covered":[], "num_sequences":len(node.sequences), "covered_score":0.0, "thr":[0.0,0.0] }
best_list = []
good_list = []
for mod_name in node.models:
rs = getRepresentedSequences(mod_name, modScores[mod_name], modSeqs[mod_name], node.sequences)
rs_covered = len(rs["covered"])
rs_support = float(rs_covered)/rs["num_sequences"]
best_covered = len(best["covered"])
if rs_covered > best_covered:
best_list = [ (rs["model"],rs["covered_score"]) ]
elif rs_covered == best_covered:
best_list.append( (rs["model"],rs["covered_score"]) )
if rs_support >= .9:
good_list.append( (rs["model"],rs["covered_score"]) )
best = getBestRepresentativeModel(best,rs)
best_list.sort(key=lambda x:float(x[1]),reverse=True)
good_list.sort(key=lambda x:float(x[1]),reverse=True)
node.add_feature("repr_model",best["model"])
node.add_feature("best_models",best_list)
node.add_feature("good_models",good_list)
node.add_feature("covered",best["covered"])
node.add_feature("num_covered",len(node.covered))
node.add_feature("num_sequences",len(node.sequences))
node.add_feature("mod_support",float(node.num_covered)/node.num_sequences)
node.add_feature("covered_score",best["covered_score"])
node.add_feature("thr",best["thr"])
node.support = node.mod_support if (node.mod_support >= opts.minCoverage and node.num_covered != node.children[0].num_covered and node.num_covered != node.children[1].num_covered) else 0.0
node.support = node.support if len(node.sequences) > 2 or (node.num_covered == 2 and node.num_sequences == 2) else 0.0
node.add_feature("conf",node.support)
with open('{}.models.tsv'.format(opts.outFile),'w') as ofh:
fmt = ['{nodeid}','{repr_model}','{mod_support:.2f}','{num_covered}','{num_sequences}','{seq_name}','{is_covered}','{best_models}','{good_models}']
for node in funTree.iter_descendants('preorder'):
if node.is_leaf() or node.num_covered <= 2:
continue
if node.support > 0.0: # has a representative model
covered = set(node.covered)
best_set = set()
good_set = set()
best_list=[]
for p in node.best_models:
if p[0] not in best_set:
best_list.append(p[0])
best_set.add(p[0])
good_list=[]
for p in node.good_models:
if p[0] not in best_set and p[0] not in good_set:
good_list.append(p[0])
good_set.add(p[0])
for seq_name in node.sequences:
ofh.write('\t'.join(fmt).format( nodeid=node.name, repr_model=node.repr_model, mod_support=node.mod_support,
num_covered=node.num_covered, num_sequences=node.num_sequences, seq_name=seq_name,
is_covered=(seq_name in covered), best_models=';'.join(best_list), good_models=';'.join(good_list) ))
ofh.write('\n')
funTree.write(features=['name','conf','repr_model','num_covered'], format=2, outfile='{}.nhx'.format(opts.outFile))
return 0
# Check if the program is not being imported
if __name__ == "__main__":
sys.exit(main())
#!/usr/bin/env python3
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
import sys, os
import argparse
from operator import itemgetter
from pv_utils import *
def main( argv = None ):
# GET PARAMETERS
parser = argparse.ArgumentParser()
parser.add_argument('--seq-list', dest='seqListFile', required=True, help='CSV file containing, for each line, sequence-name, seq-function, seq-family, seq-length')
parser.add_argument('--hmm-list', dest='hmmListFile', required=True, help='CSV file containing, for each line, hmm-name, pfam-domain, nseq, model-length')
parser.add_argument('--scores', dest='scoreFile', required=True, help='score file obtained with parse_tblout_v2.py from the output obtained with hmmsearch')
parser.add_argument('-n', '--nseq', dest='nseq', type=int, default=20, help='consider only HMMs with a number of sequences greater or equal than this value (default:5)')
parser.add_argument('-k', '--k-best', dest='kBest', type=int, default=1, help='consider only the k best models mapped against each sequence (default:1)')
parser.add_argument('-p', '--prefix', dest='outPrefix', default='out', help='output file prefix')
args = parser.parse_args()
# VALIDATE PARAMETERS
validParameters = True
if not os.path.isfile(args.seqListFile):
validParameters = False
print_error("sequence list file \"{}\" does not exist.".format(args.seqListFile))
if not os.path.isfile(args.hmmListFile):
validParameters = False
print_error("HMM list file \"{}\" does not exist.".format(args.hmmListFile))
if not os.path.isfile(args.scoreFile):
validParameters = False
print_error("score file \"{}\" does not exist.".format(args.scoreFile))
if not validParameters:
return 1
# HANDLE "BAD" PARAMETERS
if args.kBest < 1:
print_warning("-k/--k-best parameter must be a positive integer, default value of 1 will be used")
args.kBest = 1
# LOAD SEQUENCE DICTIONARY
seqDict = {}
with open(args.seqListFile,'r') as seqListFile:
for line in seqListFile:
record = [ x.strip() for x in line.split(',') ]
seq_name, seq_func, seq_fam = record[0:3]
seq_len = int(record[3])
seqDict[seq_name] = { 'Function':seq_func, 'Family':seq_fam, 'Length':seq_len }
# LOAD HMMER MODEL DICTIONARY
hmmDict = {}
with open(args.hmmListFile,'r') as hmmListFile:
for line in hmmListFile:
record = [ x.strip() for x in line.split(',') ]
hmm_name, pfam_acc = record[0:2]
hmm_nseq = int(record[2])
hmm_len = int(record[3])
if int(hmm_nseq) >= args.nseq:
hmmDict[hmm_name] = { 'Domain':pfam_acc, 'NSEQ':hmm_nseq, 'Length':hmm_len }
print_status("sequences: {}".format(len(seqDict)))
print_status("models with >= {} sequences: {}".format(args.nseq,len(hmmDict)))
allHits = []
strictHits = []
hmmStats = {}
seqStats = {}
with open(args.scoreFile,'r') as scoreFile:
for line in scoreFile:
line = line.strip()
if (not line) or line.startswith('#'):
continue
# the score file is formatted with the following (tab-separated) fields:
# hmm_name hmm_len seq_name seq_len seq_family seq_func bitscore mean_score mcscore mean_mcs wcscore mean_wcs ident hmm_cov hit_type
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
fields = [ x.strip() for x in line.split('\t') ]
hmm_name = fields[0]
hmm_len = int(fields[1])
seq_name = fields[2]
seq_len = int(fields[3])
bitscore = float(fields[6])
mean_bs = float(fields[7])
mcscore = float(fields[8])
mean_mcs = float(fields[9])
wcscore = float(fields[10])
mean_wcs = float(fields[11])
identity = float(fields[12])
hmm_cov = float(fields[13])
hit_type = fields[14]
if hmm_name in hmmDict:
if hmm_name not in hmmStats:
hmmStats[hmm_name] = { 'FULL':0, 'OVER':0, 'PART':0 }
if seq_name not in seqStats:
seqStats[seq_name] = { 'FULL':0, 'OVER':0, 'PART':0, 'MODELS':[] }
hit = [ hmm_name, seq_name, bitscore, mean_bs, mcscore, mean_mcs, wcscore, mean_wcs, identity, hmm_cov, hit_type ]
allHits.append(hit)
seqStats[seq_name]['MODELS'].append( (hmm_name,mean_bs,mean_wcs,hmm_len,hmm_cov,hit_type) )
if (hit_type == 'FULL' or
(hit_type == 'PART' and hmm_cov >= 75.0) or
(hit_type == 'OVER' and hmm_cov >= 30.0)):
strictHits.append(hit)
if hit_type == 'FULL':
hmmStats[hmm_name]['FULL'] += 1
seqStats[seq_name]['FULL'] += 1
elif hit_type == 'PART':
hmmStats[hmm_name]['PART'] += 1
seqStats[seq_name]['PART'] += 1
elif hit_type == 'OVER':
hmmStats[hmm_name]['OVER'] += 1
seqStats[seq_name]['OVER'] += 1
print_status("{} hits loaded ({} strict)".format(len(allHits),len(strictHits)))
# Discard partial/incomplete sequences
with open("{}.discarded".format(args.outPrefix),'w') as outFile:
partialSequences = []
for seq_name in seqStats:
stats = seqStats[seq_name]
tot = stats['FULL'] + stats['PART'] + stats['OVER']
full = float(stats['FULL'])/tot if tot > 0 else 0.0
part = float(stats['PART'])/tot if tot > 0 else 0.0
over = float(stats['OVER'])/tot if tot > 0 else 0.0
if (full < .5 and part < .5) or over >= .3: # significant number of sequence-model "overlapping" hits (i.e., the sequence is likely to be incomplete)
partialSequences.append(seq_name)
outFile.write("{}\t{:.4f}\t{:.4f}\t{:.4f}\n".format(seq_name,full,over,part))
for seq_name in partialSequences:
del seqStats[seq_name]
del seqDict[seq_name]
print_status("{} putative incomplete sequences, {} remaining".format(len(partialSequences),len(seqDict)))
kBest1 = set()
kBest2 = set()
with open("{}.seq.stats".format(args.outPrefix),'w') as ofh:
for sname in seqStats:
seqStats[sname]['MODELS'].sort(key=itemgetter(2),reverse=True) # sort by mean-hcscore
for i in range(min(args.kBest,len(seqStats[sname]['MODELS']))):
kBest1.add(seqStats[sname]['MODELS'][i][0])
seqStats[sname]['MODELS'].sort(key=itemgetter(1),reverse=True) # sort by mean-bitscore
for i in range(min(args.kBest,len(seqStats[sname]['MODELS']))):
kBest2.add(seqStats[sname]['MODELS'][i][0])
#kBestModelSet.add(seqStats[sname]['MODELS'][i][0])
cols = [ sname, seqDict[sname]['Family'], seqDict[sname]['Function'] ] + [ str(x) for x in seqStats[sname]['MODELS'][i] ]
ofh.write( '\t'.join(cols) + '\n' )
kBestModelSet = kBest2 | kBest1
print_status("kBest models: {}".format(len(kBestModelSet)))
with open('{}.used_models.list'.format(args.outPrefix), 'w') as f:
for m in kBestModelSet:
f.write('{}\n'.format(m))
# OUTPUT FEATURE MATRIX BASED ON ALL HITS
bestMeanBSDict = {}
bestMeanHCSDict = {}
for hit in allHits:
key = tuple(hit[0:2]) # (hmm_name,seq_name)
bitscore = float(hit[2])
wcscore = float(hit[6])
mcscore = float(hit[4])
mean_bs, mean_cms, mean_wcs = [ float(hit[x]) for x in [3,5,7] ] # mean_bit_score mean_consensus-match_score mean_highly-conserved-position_score
if key[0] not in kBestModelSet:
continue
if (key not in bestMeanBSDict or mean_bs > bestMeanBSDict[key]):
bestMeanBSDict[key] = mean_bs
if (key not in bestMeanHCSDict or mean_wcs > bestMeanHCSDict[key]):
bestMeanHCSDict[key] = mean_wcs #mean_hcs # mean_mcs # mean_hcs
print_status("kBest hits: {}".format(len(bestMeanBSDict)))
with open("{}.feat".format(args.outPrefix),'w') as outFile:
outFile.write( '\t'.join(['SeqName','Label','Family'] +
[ name.split('_')[0]+'_MBS' for name in kBestModelSet ] +
[ name.split('_')[0]+'_MHCS' for name in kBestModelSet ]) + '\n' )
for seq_name in seqDict:
scores = [ "{:.4f}".format(bestMeanBSDict.get((hmm_name,seq_name),0.0)) for hmm_name in kBestModelSet ] + \
[ "{:.4f}".format(bestMeanHCSDict.get((hmm_name,seq_name),0.0)) for hmm_name in kBestModelSet ]
out_row = [ seq_name, seqDict[seq_name]['Function'], seqDict[seq_name]['Family'] ] + scores
outFile.write( '\t'.join(out_row) + '\n' )
# OUTPUT FEATURE MATRIX BASED ONLY ON STRICT HITS
# bestMeanBSDict = {}
# bestMeanHCSDict = {}
# for hit in strictHits:
# key = tuple(hit[0:2]) # (hmm_name,seq_name)
# mean_bs, mean_cms, mean_hcs = [ float(x) for x in hit[3:8:2] ]
# if key[0] not in kBestModelSet:
# continue
# if (key not in bestMeanBSDict or mean_bs > bestMeanBSDict[key]):
# bestMeanBSDict[key] = mean_bs
# bestMeanHCSDict[key] = mean_mcs
# bestMeanWSDict[key] = mean_hcs
# print_status("kBest (strict) hits: {}".format(len(bestMeanBSDict)))
#
# with open("{}.strict.feat".format(args.outPrefix),'w') as outFile:
# outFile.write( '\t'.join(['SeqName','Label','Family'] +
# [ name.split('_')[0]+'_MBS' for name in kBestModelSet ] +
# [ name.split('_')[0]+'_MWS' for name in kBestModelSet ] +
# [ name.split('_')[0]+'_MHCS' for name in kBestModelSet ]) + '\n' )
# for seq_name in seqDict:
# scores = [ "{:.4f}".format(bestMeanBSDict.get((hmm_name,seq_name),0.0)) for hmm_name in kBestModelSet ] + \
# [ "{:.4f}".format(bestMeanWSDict.get((hmm_name,seq_name),0.0)) for hmm_name in kBestModelSet ] + \
# [ "{:.4f}".format(bestMeanHCSDict.get((hmm_name,seq_name),0.0)) for hmm_name in kBestModelSet ]
# out_row = [ seq_name, seqDict[seq_name]['Function'], seqDict[seq_name]['Family'] ] + scores
# outFile.write( '\t'.join(out_row) + '\n' )
return 0
# Check if the program is not being imported
if __name__ == "__main__":
sys.exit(main())
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
import math
import operator
from BitVector import BitVector
from pv_utils import *
class hhm():
def __init__(self, hhm_name, aa2idx, neff_hmm, neff_m, consensus, master_seq, match_states):
self.name = hhm_name
self.aa2idx = aa2idx
self.consensus = consensus
self.master_seq = master_seq
self.match_states = match_states
self.neff_hmm = neff_hmm
self.neff_m = neff_m
self.conserved = BitVector( bitstring = ''.join([ '1' if c.isupper() else '0' for c in self.consensus ]) )
def __str__(self):
return "NAME: {}\nCONSENSUS: {}\nCONSERVED: {}\nMASTERSEQ: {}".format(self.name, self.consensus, self.conserved, self.master_seq)
def frequency(self, state, aa):
aaidx = self.aa2idx[aa]
val = self.match_states[state][aaidx]
return 2**(-val/1000.0)
def state_frequencies(self, state):
out = {}
for aa in self.aa2idx:
aai = self.aa2idx[aa]
val = self.match_states[state][aai]
out[aa] = 2**(-val/1000.0)
return out
def size(self):
return len(self.master_seq)
def get_consensus(self):
return self.consensus
def parse_hhm(fh):
hhm_name = ""
aa_list = []
aa2idx = {}
consensus = ""
master_seq = ""
match_states = []
neff_hmm = 0.0
neff_m = [99.999]
while True:
line = fh.readline()
if line == '':
break
tokens = line.split()
if len(tokens) == 0:
continue
elif tokens[0] == "NAME":
hhm_name = tokens[1]
elif tokens[0] == "NEFF":
neff_hmm = float(tokens[1])
elif tokens[0] == "SEQ":
while not line.startswith('>Consensus'):
line = fh.readline()
# found consensus sequence id
line = fh.readline()
while line and line[0] != '>':
consensus += ''.join(line.split())
line = fh.readline()
# found master sequence id
line = fh.readline()
while line and line[0] != '>' and line[0] != '#':
master_seq += ''.join(line.split())
line = fh.readline()
# consensus/master sequences processed, discard remaining sequences until the end of the SEQ block
while line and line[0] != '#':
line = fh.readline()
elif tokens[0] == "NULL":
null_values = [ int(v) if is_int(v) else 99999 for v in tokens[1:21] ]
match_states.append(null_values)
elif tokens[0] == "HMM":
# load AA alphabet
aa_list = tokens[1:21]
for i,aa in enumerate(aa_list):
aa2idx[aa] = i
# skip first two lines (transitions definitions/start probabilities)
fh.readline() # M->M M->I M->D I->M I->I D->M D->D Neff NeffI NeffD
fh.readline() # 0 * * 0 * 0 * * * *
line = fh.readline()
while line and not line.startswith("//"):
tokens = line.split()
if len(tokens) == 23:
state_values = [ int(val) if is_int(val) else 99999 for val in tokens[2:22] ]
match_states.append(state_values)
elif len(tokens) == 10:
val = tokens[7]
neff_m.append( int(val)/1000.0 if is_int(val) else 99.999 )
line = fh.readline()
assert len(match_states) == len(master_seq)+1
return hhm( hhm_name, aa2idx, neff_hmm, neff_m, consensus, master_seq, match_states )
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
import math
import operator
from pv_utils import *
class hmm_model():
def __init__(self):
self.name = ""
self.length = 0
self.nseq = 0
self.states = []
self.cons = ""
self.aa_list = [ c for c in "ACDEFGHIKLMNPQRSTVWY" ]
self.aa2idx = { c:i for i,c in enumerate(self.aa_list) }
self.bg_freq = [ -math.log(0.05) for c in self.aa_list ]
def __str__(self):
consensus = []
for k,s in enumerate(self.states):
aai, minval = min(enumerate(s), key=operator.itemgetter(1))
f = self.frequency(k, self.aa_list[aai])
c = self.aa_list[aai].upper() if f >= .6 else (self.aa_list[aai].lower() if f >= .4 else 'x')
consensus.append(c)
return "NAME={}\nLENG={} STATES={}\nNSEQ={}\nCONS={}".format(self.name, self.length,len(self.states),self.nseq, ''.join(consensus))
def frequency(self, k, aa):
aaidx = self.aa2idx[aa]
return math.exp(-self.states[k][aaidx])
# def consensus(self):
# if not self.cons:
# consensus = []
# for k,s in enumerate(self.states):
# aai, minval = min(enumerate(s), key=operator.itemgetter(1))
def getStateFreqDict(self, k):
outDict = {}
for aa in aa2idx:
aaidx = self.aa2idx[aa]
outDict[aa] = math.exp(-self.states[k][aaidx])
return outDict
def ln_odds_ratio(self,k,aa):
aai = self.aa2idx[aa]
m_k_a = self.states[k][aai]
m_0_a = self.bg_freq[aai]
return (m_0_a - m_k_a)
def score(self, k, aa):
return self.ln_odds_ratio(k,aa)/math.log(2.0) # convert to base2 logarithm
def is_valid(self):
return len(self.name) > 0 and self.length > 0 and self.length == len(self.states) and self.nseq > 0
class hmmer_hit():
def __init__(self):
self.hmm_name = ""
self.seq_name = ""
self.typ = '?'
self.score = 0.0 # HMMER's bit score
self.mcscore = 0.0 # sum of scores in well-aligned positions (PP[i] \in {8,9,*}) with a match between the sequence and the hmm consensus
self.wcscore = 0.0 # sum of scores in well-aligned positions (PP[i] \in {8,9,*}) with score[i] >= threshold
self.matches = 0 # number of matches between the sequence and the hmm consensus
self.aligned_cols = 0 # number of aligned match states
self.well_aligned_cols = 0 # number of well-aligned match states
self.ali_len = 0 # length of alignment
self.hmm_from = 0
self.hmm_to = 0
self.ali_from = 0
self.ali_to = 0
self.env_from = 0
self.env_to = 0
self.hmm_cons = ""
self.seq_cons = ""
self.pp_cons = ""
self.wc_str = ""
self.cm_str = ""
self.match_str = ""
def canExtendWith(self,b):
if self.hmm_name != b.hmm_name or self.seq_name != b.seq_name or self.typ == '?' or b.typ == '?':
return False
Lm = b.hmm_to-self.hmm_from+1
Ls = b.ali_to-self.ali_from+1
return self.hmm_from <= b.hmm_from and self.ali_from <= b.ali_from and math.fabs(Ls-Lm)/float(min(Ls,Lm)) <= .2
# it is assumed that canExtendWith(b) holds True
def extendWith(self,b):
self.score += b.score
self.mcscore += b.mcscore
self.wcscore += b.wcscore
self.matches += b.matches
self.aligned_cols += b.aligned_cols
self.well_aligned_cols += b.well_aligned_cols
self.ali_len += b.ali_len
self.hmm_from = min(self.hmm_from,b.hmm_from)
self.hmm_to = max(self.hmm_to,b.hmm_to)
self.ali_from = min(self.ali_from,b.ali_from)
self.ali_to = max(self.ali_to,b.ali_to)
self.env_from = min(self.env_from,b.env_from)
self.env_to = max(self.env_to,b.env_to)
self.wc_str += b.wc_str
self.cm_str += b.cm_str
self.match_str += b.match_str
return self
# def __str__(self):
# return "HMM {}\nSEQ {}\nSC {}\nHMM-SEQ ALIGNMENT {}-{} {}-{}\n{}\n{}\n{}".format(
# self.hmm_name, self.seq_name, self.score,
# self.hmm_from, self.hmm_to, self.ali_from, self.ali_to,
# self.hmm_cons, self.seq_cons, self.pp )
#!/usr/bin/env python3
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
import sys, os, io
import argparse
import re
import gzip, pickle
from collections import defaultdict
import weblogolib as weblogo
from BitVector import BitVector
from Bio import SeqIO
from pv_utils import *
from hh_utils import *
class hhalignment(object):
def __init__(self):
self.queryName = ""
self.targetName = ""
self.matchStr = ""
self.confidenceStr = ""
self.probab = 0.0
self.evalue = 0.0
self.score = 0.0
self.aligned_cols = 0
self.identities = 0.0
self.similarity = 0.0
self.sum_probs = 0.0
self.template_neff = 0.0
self.reset_query()
self.reset_target()
def reset_query(self):
self.queryLen = 0
self.queryStart = -1
self.queryEnd = -1
self.queryConsensus = ""
self.queryReference = ""
def reset_target(self):
self.targetLen = 0
self.targetStart = -1
self.targetEnd = -1
self.targetConsensus = ""
self.targetReference = ""
def __str__(self):
return "Query={}; Start={}; End={}; Len={}; Consensus={}; Reference={}; Valid={}\n".format(self.queryName,self.queryStart,self.queryEnd,self.queryLen,len(self.queryConsensus)>0,len(self.queryReference)>0,self.is_query_valid()) \
+ "Target={}; Start={}; End={}; Len={}; Consensus={}; Reference={}; Valid={}\n".format(self.targetName,self.targetStart,self.targetEnd,self.targetLen,len(self.targetConsensus)>0,len(self.targetReference)>0,self.is_target_valid()) \
+ "Confidence={}; Valid={}".format(len(self.confidenceStr)>0,self.is_valid())
def is_query_valid(self):
return self.queryName != "" and self.queryLen > 0 and self.queryStart >= 0 and self.queryEnd <= self.queryLen and \
self.queryStart <= self.queryEnd and self.queryReference != "" and len(self.queryConsensus) == len(self.queryReference)
def is_target_valid(self):
return self.targetName != "" and self.targetLen > 0 and self.targetStart >= 0 and self.targetEnd <= self.targetLen and \
self.targetStart <= self.targetEnd and self.targetReference != "" and len(self.targetConsensus) == len(self.targetReference)
def is_valid(self):
return self.is_query_valid() and self.is_target_valid() and self.confidenceStr != "" and self.matchStr != ""
def well_aligned_regions(self):
qgaps = 0
tgaps = 0
lastend = 0
regions = []
for m in re.finditer(r"[9]+",self.confidenceStr):
beg,end = m.span()
qgaps += sum( self.queryReference[i] == '-' for i in range(lastend,beg) )
qbeg = self.queryStart + beg - qgaps
qend = qbeg + end - beg
tgaps += sum( self.targetReference[i] == '-' for i in range(lastend,beg) )
tbeg = self.targetStart + beg - tgaps
tend = tbeg + end - beg
regions.append( ((qbeg,qend),(tbeg,tend)) )
lastend = end
return regions
def read_alignment(fh):
tseq = ""
tseqIndex = 0
qseqIndex = 0
hha = hhalignment()
line = fh.readline()
while line:
line = line.rstrip()
if line.startswith("Query "): # Query initialization
if hha.is_valid():
yield hha
hha = hhalignment()
hha.reset_query()
hha.reset_target()
hha.queryName = line.split()[1]
hha.targetName = ""
elif line.startswith(">"): # Target initialization
if hha.is_valid():
yield hha
hha = hhalignment()
hha.reset_query()
hha.reset_target()
hha.targetName = line.split()[0][1:]
# read following line:
# Probab=<val> E-value=<val> Score=<val> Aligned_cols=<val> Identities=<val>% Similarity=<val> Sum_probs=<val> Template_Neff=<val>
values = [ p.split('=')[1] for p in fh.readline().split() ]
hha.probab = float(values[0])
hha.evalue = float(values[1])
hha.score = float(values[2])
hha.aligned_cols = int(values[3])
hha.identities = int(values[4][:-1]) # last character is '%', it must be removed before converting the string
hha.similarity = float(values[5])
hha.sum_probs = float(values[6])
hha.template_neff = float(values[7])
elif line.startswith("Q "): # Query alignment lines
_,qname,qbeg,qseq,qend,qlen = line.split()
if qname in ["ss_pred","ss_dssp"]:
continue
elif qname == "Consensus":
hha.queryConsensus += qseq
# following the line just read, there is the match string between query and target consensus
qseqIndex = line.index(qseq)
qseq_len = len(qseq)
match_line = fh.readline().rstrip()
match_str = match_line[qseqIndex:qseqIndex+qseq_len].ljust(qseq_len)
hha.matchStr += match_str
else:
hha.queryStart = int(qbeg)-1 if hha.queryStart < 0 else hha.queryStart # Note: hha.queryStart is negative when uninitialized, update only when the first part of the aligned is found
hha.queryEnd = int(qend) # last value found should be the maximum one
hha.queryLen = int(qlen[1:-1]) # the value is specified between round brackets
hha.queryReference += qseq
elif line.startswith("T "): # Target alignment lines
_,tname,tbeg,tseq,tend,tlen = line.split()
if tname in ["ss_pred","ss_dssp"]:
continue
elif tname == "Consensus":
hha.targetConsensus += tseq
else:
hha.targetStart = int(tbeg)-1 if hha.targetStart < 0 else hha.targetStart # Note: hha.queryStart is negative when uninitialized, update only when the first part of the aligned is found
hha.targetEnd = int(tend) # last value found should be the maximum one
hha.targetLen = int(tlen[1:-1]) # the value is specified between round brackets
hha.targetReference += tseq
tseqIndex = line.index(tseq)
elif line.startswith("Confidence"):
tseq_len = len(tseq)
conf_line = line[tseqIndex:tseqIndex+tseq_len].ljust(tseq_len).replace(' ','-')
hha.confidenceStr += conf_line
tseqIndex = 0
line = fh.readline()
if hha.is_valid(): # output last one
yield hha
def main( argv = None ):
# parameter definition
parser = argparse.ArgumentParser()
parser.add_argument('--hhm-dict', dest='hhmDictFile', required=True, type=str, help='HHM dictionary .pgz file')
parser.add_argument('--hhm-list', dest='hhmListFile', help='List of HHM models to consider')
parser.add_argument('--hhr-dir', dest='hhrDir', required=True, type=str, help='directory path of profile-profile comparisons in hhr format')
parser.add_argument('--fas-dir', dest='fasDir', required=True, type=str, help='directory path of fas sequences')
parser.add_argument('--out-dir', dest='outDir', required=True, type=str, metavar="prefix", help='output directory')
args = parser.parse_args()
# parameter validation
validParameters = True;
if not os.path.isfile(args.hhmDictFile):
validParameters = False
print_error('HHM dictionary file "{}" does not exist'.format(args.hhmDictFile))
if args.hhmListFile and not os.path.isfile(args.hhmListFile):
validParameters = False
print_error('HHM list file "{}" does not exist'.format(args.hhmListFile))
elif args.hhrDir is not None and not os.path.isdir(args.hhrDir):
validParameters = False
print_error('--hhr-dir \"{}\" is not a directory'.format(args.hhrDir))
elif args.fasDir is not None and not os.path.isdir(args.fasDir):
validParameters = False
print_error('--fas-dir \"{}\" is not a directory'.format(args.fasDir))
if not validParameters:
return 1
print_status("loading HHM dictionary and list")
hhmDict = {}
with gzip.open( args.hhmDictFile, 'r') as f:
hhmDict = pickle.load(f)
hhmNameDict = {}
if args.hhmListFile:
with open(args.hhmListFile,'r') as f:
for line in f:
name = [ x.strip() for x in line.split(',') ]
if len(name) > 0 and (not name[0].startswith('#')):
hhm_id = name[0]
hhm_name = name[1] if len(name) > 1 else hhm_id
hhmNameDict[hhm_id] = hhm_name
else:
hhmNameDict = { k:k for k in hhmDict }
print_status("loading HHR alignment files")
fileList = []
if args.hhrDir is not None:
hhrdir = args.hhrDir[:-1] if args.hhrDir.endswith("/") else args.hhrDir
fileList = [ '{}/{}'.format(hhrdir,f) for f in os.listdir(hhrdir) if f.endswith('.hhr') and os.path.isfile('{}/{}'.format(hhrdir,f)) ]
alnDict = defaultdict(list)
for hhrFileName in fileList:
with open(hhrFileName,'r') as hhrFile:
for hhalign in read_alignment(hhrFile):
alnDict[hhalign.queryName].append(hhalign)
for qname in alnDict:
alnDict[qname].sort( key=lambda x:x.evalue )
# with open(args.outFile, 'w') as of:
# for queryName in alnDict:
# for hha in alnDict[queryName]:
# outList = [ hha.queryName, hha.targetName, hha.probab, hha.evalue, hha.score, hha.aligned_cols, hha.identities, hha.similarity, hha.sum_probs, hha.template_neff ]
# of.write('\t'.join((str(x) for x in outList)) + '\n')
if not os.path.exists(args.outDir):
os.mkdir(args.outDir)
for queryName in alnDict:
if queryName not in hhmNameDict:
continue
hhm = hhmDict[queryName]
#queryConsensus = [ qc for i,qc in enumerate(hhm.consensus) ] # if hhm.conserved[i] ]
#queryCovered = BitVector(size=len(hhm.master_seq))
# output sequence logo restricted to conserved positions
with open( "{}/{}.fas".format( args.fasDir, queryName ), 'r' ) as fas:
fas_str = ""
for rec in SeqIO.parse(fas,'fasta'):
fas_str += ">{}\n{}\n".format(rec.id,''.join([ c for i,c in enumerate(rec.seq) if hhm.conserved[i] ]))
loptions = weblogo.LogoOptions(
logo_title="{} ({})".format(hhmNameDict[queryName],queryName),
unit_name="bits",
yaxis_label="",
#show_yaxis=False,
color_scheme=weblogo.std_color_schemes["chemistry"],
annotate=[ i+1 for i,b in enumerate(hhm.conserved) if b ],
rotate_numbers=True,
stack_aspect_ratio=3,
stack_width=weblogo.std_sizes["large"],
stacks_per_line=hhm.conserved.count_bits(),
show_fineprint=False )
ldata = weblogo.LogoData.from_seqs( weblogo.read_seq_data(io.StringIO(fas_str)) )
lformat = weblogo.LogoFormat(ldata,loptions)
with open( os.path.join(args.outDir, "{}.conserved.pdf".format(hhmNameDict[queryName])), 'wb' ) as ofh:
ofh.write( weblogo.pdf_formatter(ldata,lformat) )
with open( "{}/{}.fas".format( args.fasDir, queryName ), 'r' ) as fas:
loptions = weblogo.LogoOptions(
logo_title="{} ({})".format(hhmNameDict[queryName],queryName),
unit_name="bits",
yaxis_label="",
#show_yaxis=False,
color_scheme=weblogo.std_color_schemes["chemistry"],
#rotate_numbers=True,
stack_aspect_ratio=3,
stack_width=weblogo.std_sizes["large"],
stacks_per_line=len(hhm.conserved),
show_fineprint=False )
ldata = weblogo.LogoData.from_seqs( weblogo.read_seq_data(fas) )
lformat = weblogo.LogoFormat(ldata,loptions)
with open( os.path.join(args.outDir, "{}.full.pdf".format(hhmNameDict[queryName])), 'wb' ) as ofh:
ofh.write( weblogo.pdf_formatter(ldata,lformat) )
otherConserved = BitVector(size=len(hhm.master_seq)) # position conserved in any other template that aligns on a given query postion
consDict = { qi:[] for qi in range(hhm.size()) if hhm.conserved[qi] }
lessConsDict = { qi:[] for qi in range(hhm.size()) if hhm.conserved[qi] }
for a in alnDict[queryName]:
if a.targetName in hhmNameDict:
qi = a.queryStart
ti = a.targetStart
t_hhm = hhmDict[a.targetName]
for j,qc in enumerate(a.queryConsensus):
tc = a.targetConsensus[j]
if qc == '-': # skip gaps in query
ti += 1
continue
if tc != '-' and t_hhm.conserved[ti]:
otherConserved[qi] = True
if hhm.conserved[qi] and a.confidenceStr[j] in ['8','9'] and a.targetConsensus[j]==qc and qc.isupper(): # same conserved residue in target
consDict[qi].append(a.targetName)
elif hhm.conserved[qi] and a.confidenceStr[j] in ['8','9'] and a.matchStr[j]=='|': # TODO: admit also a.targetConsensus[j]==qc.lower() ???
lessConsDict[qi].append(a.targetName)
qi += 1
ti += (1 if tc != '-' else 0)
# output sequence logo of important positions (either conserved in the current model or in any other one that aligns against it
with open( "{}/{}.fas".format( args.fasDir, queryName ), 'r' ) as fas:
fas_str = ""
for rec in SeqIO.parse(fas,'fasta'):
fas_str += ">{}\n{}\n".format(rec.id,''.join([ c for i,c in enumerate(rec.seq) if (hhm.conserved[i] or otherConserved[i]) ]))
ann_list = []
for i in range(len(rec.seq)):
if hhm.conserved[i]:
ann_list.append(str(i+1))
elif otherConserved[i]:
ann_list.append('')
loptions = weblogo.LogoOptions(
logo_title="{} ({})".format(hhmNameDict[queryName],queryName),
unit_name="bits",
yaxis_label="",
#show_yaxis=False,
color_scheme=weblogo.std_color_schemes["chemistry"],
annotate=ann_list,
rotate_numbers=True,
stack_aspect_ratio=3,
stack_width=weblogo.std_sizes["large"],
stacks_per_line=len(ann_list),
show_fineprint=False )
ldata = weblogo.LogoData.from_seqs( weblogo.read_seq_data(io.StringIO(fas_str)) )
lformat = weblogo.LogoFormat(ldata,loptions)
with open( os.path.join(args.outDir, "{}.any_conserved.pdf".format(hhmNameDict[queryName])), 'wb' ) as ofh:
ofh.write( weblogo.pdf_formatter(ldata,lformat) )
# output conserved positions for this model
with open( os.path.join(args.outDir, "{}.cons".format(hhmNameDict[queryName])), 'w') as ofh:
for qi in sorted(consDict):
record = [ hhmNameDict[queryName], str(qi+1), hhm.consensus[qi] ] + [ hhmNameDict[targetName] for targetName in consDict[qi] ] + [ "*"+hhmNameDict[targetName] for targetName in lessConsDict[qi] ]
ofh.write( '{}\n'.format('\t'.join(record)) )
return 0
# Check if the program is not being imported
if __name__ == "__main__":
sys.exit(main())
#!/ usr/bin/env python
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
import sys, errno, os
import argparse
import gzip
import re
import pickle
from hmm_utils import *
from pv_utils import *
def parseHMMER(fh,hmmDict):
hmm_name = ""
seq_name = ""
line = fh.readline()
while line:
tokens = line.split()
if len(tokens) == 0 or line.startswith("#"):
pass
elif line.startswith("Query: "):
hmm_name = tokens[1]
elif line.startswith(">> "): # target sequence section
seq_name = tokens[1]
# skip seqName domain table header
fh.readline()
fh.readline()
# process domain table
line = fh.readline()
hitTable = []
while line and not line.lstrip().startswith("== domain "):
tokens = line.split()
## domain table row:
## domNum [!?] score bias c-Evalue i-Evalue hmmfrom hmm-to .. alifrom ali-to .. envfrom env-to .. acc
if len(tokens) == 16:
#eprint('[debug] domain-table: {}'.format(line[:-1]))
hit = hmmer_hit()
hit.hmm_name = hmm_name
hit.seq_name = seq_name
hit.typ = tokens[1]
hit.score = float(tokens[2])
hit.hmm_from = int(tokens[6])
hit.hmm_to = int(tokens[7])
hit.ali_from = int(tokens[9])
hit.ali_to = int(tokens[10])
hit.env_from = int(tokens[12])
hit.env_to = int(tokens[13])
hitTable.append(hit)
line = fh.readline()
# process each domain hit (i.e., hmm-seq alignment)
while line and line.lstrip().startswith("== domain "):
tokens = line.split()
hit = hitTable[ int(tokens[2])-1 ]
hmm_cons = ""
seq_cons = ""
pp_cons = ""
while True:
line = fh.readline()
tokens = line.split()
if len(tokens) == 4 and tokens[0] == hit.hmm_name:
hmm_cons += tokens[2]
#fh.readline() # skip hmm-seq match string
match_cons_i = re.search(r"\s*\S+\s+[0-9]+",line).end()+1
match_line = fh.readline()
hit.match_str += match_line[ match_cons_i : match_cons_i+len(tokens[2]) ]
tokens = fh.readline().split()
seq_cons += tokens[2]
aln_to = int(tokens[3])
tokens = fh.readline().split()
pp_cons += tokens[0]
if aln_to == hit.ali_to:
break
hit.hmm_cons = hmm_cons
hit.seq_cons = seq_cons
hit.pp_cons = pp_cons
if hit.typ == '!': # significant hit
hmm = hmmDict[hit.hmm_name]
# compute scores
hit.aligned_cols = 0
hit.well_aligned_cols = 0
hit.matches = 0
hit.wcscore = 0.0
hit.mcscore = 0.0
hmm_i = hit.hmm_from
for i,hmm_c in enumerate(hmm_cons):
hit.ali_len += 1
if hmm_c == '.': # insertion in sequence w.r.t. the model
continue
seq_c = seq_cons[i]
if seq_c == '-': # deletion in sequence w.r.t. the model
hmm_i += 1
continue
hit.aligned_cols += 1
hit.matches += 1 if hmm_c.upper() == seq_c.upper() else 0
if seq_c in 'ACDEFGHIKLMNPQRSTVWY' and pp_cons[i] in list('89*'):
score_i = hmm.score(hmm_i,seq_c)
hit.well_aligned_cols += 1
if hmm_c.upper() == seq_c.upper():
hit.cm_str += hit.match_str[i]
hit.mcscore += score_i
if score_i >= 3.0: # high-score position
hit.wc_str += hit.match_str[i]
hit.wcscore += score_i
hmm_i += 1
yield hit
# skip lines until the next domain/sequence or the end of file
line = fh.readline()
while line and not line.lstrip().startswith("== domain ") and not line.startswith(">> ") and not line.startswith("Query: "):
line = fh.readline()
# here either EOF is reached or there are other models/sequences to process
continue
line = fh.readline()
def main( argv = None ):
# parameter definition
parser = argparse.ArgumentParser()
parser.add_argument('--hmmer-dict', dest='hmmerDictFile', type=str, required=True, help='HMMER dictionary .pgz file')
parser.add_argument('--seq-db', dest='seqFile', type=str, required=True, help='Sequence database in csv format with [name,function,family,length] fields')
args = parser.parse_args()
# parameter validation
validParameters = True;
if not os.path.isfile(args.hmmerDictFile):
validParameters = False
print_error('HMMER dictionary file "{}" does not exist'.format(args.hmmerDictFile))
if not os.path.isfile(args.seqFile):
validParameters = False
print_error('Sequence database file "{}" does not exist'.format(args.seqFile))
if not validParameters:
return 1
print_status("loading HMMER dictionary...")
hmmerDict = {}
with gzip.open(args.hmmerDictFile,'rb') as f:
hmmerDict = pickle.load(f)
print_status("loading sequence database...")
seqDict = {}
with open( args.seqFile, mode='r', newline=None ) as seqFile:
for line in seqFile:
line = line.strip()
if line:
seq_name, seq_fun, seq_fam, seq_len = [ x.strip() for x in line.split(',') ]
seqDict[ seq_name ] = { 'Function':seq_fun, 'Family':seq_fam, 'Length':int(seq_len) }
print_status("processing hmmsearch output files...")
header = ['#hmm_name','hmm_len','seq_name','seq_len','seq_family','seq_func',
'bitscore', 'mean_score', 'mcscore','mean_mcs', 'wcscore','mean_wcs',
'ident','hmm_cov','hit_type']
print('\t'.join(header))
record = []
prev_hit = hmmer_hit()
for hit in parseHMMER(sys.stdin,hmmerDict):
if hit.seq_name not in seqDict:
continue
hmm = hmmerDict[hit.hmm_name]
if prev_hit.canExtendWith(hit):
hit = prev_hit.extendWith(hit)
elif len(record) > 0:
print('\t'.join(record))
record = []
hmm_len = hmm.length
hmm_reg = hit.hmm_to-hit.hmm_from+1
hmm_loh = hit.hmm_from-1 # left overhang of the hmm hit
hmm_roh = hmm_len-hit.hmm_to # right overhang of the hmm hit
hmm_cov = 100.0 * hmm_reg / hmm_len
seq_name = hit.seq_name
seq_len = seqDict[seq_name]['Length']
seq_loh = hit.env_from-1 # left overhang of the seq hit (envelope considered)
seq_roh = seq_len-hit.env_to # right overhang of the seq hit (envelope considered)
max_oh = max( 0.1 * hmm_len, 10 )
#left_oh = min(hmm_loh,seq_loh)
#right_oh = min(hmm_roh,seq_roh)
hit_type = "FULL" if (hmm_loh <= max_oh and hmm_roh <= max_oh) else ("OVER" if (hmm_loh-seq_loh > max_oh or hmm_roh-seq_roh > max_oh) else "PART")
#hit_type = "FULL" if (hmm_loh <= max_oh and hmm_roh <= max_oh) else ("OVER" if (left_oh <= max_oh and right_oh <= max_oh) else "PART")
mean_score = 100.0 * hit.score / hit.aligned_cols
mean_mcs = 100.0 * hit.mcscore / hit.aligned_cols # hit.well_aligned_cols
mean_wcs = 100.0 * hit.wcscore / hit.aligned_cols # hit.well_aligned_cols
identity = 100.0 * hit.matches / hit.ali_len
record = [ hit.hmm_name, str(hmm_len),
seq_name, str(seq_len), seqDict[seq_name]['Family'], seqDict[seq_name]['Function'],
"{:.2f}".format(hit.score), "{:.4f}".format(mean_score),
"{:.2f}".format(hit.mcscore), "{:.4f}".format(mean_mcs),
"{:.2f}".format(hit.wcscore), "{:.4f}".format(mean_wcs),
"{:.2f}".format(identity), "{:.2f}".format(hmm_cov), hit_type ]
prev_hit = hit
if len(record) > 0:
print('\t'.join([ str(r) for r in record ]))
return 0
# Check if the program is not being imported
if __name__ == "__main__":
sys.exit(main())
#!/usr/bin/env python3
#
# This file is part of ProfileView.
#
# ProfileView is free software: you can redistribute it and/or modify
# it under the terms of the CeCILL 2.1 Licence
#
# ProfileView is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# You should have received a copy of the Licence CeCILL 2.1 along
# with ProfileView. If not, see <https://cecill.info/>.
#
import os, sys
import re
script_name = re.sub( r'\.py$', '', os.path.basename(sys.argv[0]) )
def uniq(l):
return list(set(l))
# stable uniq
def suniq(l):
outlist=[]
procs=set() # processed elements
for e in l:
if e not in procs:
procs.add(e)
outlist.append(e)
return outlist
def eprint(*args, **kwargs):
print(*args, file=sys.stderr, **kwargs)
def print_status(msg):
eprint("[{name}] {m}".format(name=script_name,m=msg))
def print_error(msg):
eprint("[{name}] error: {m}".format(name=script_name,m=msg))
def print_warning(msg):
eprint("[{name}] warning: {m}".format(name=script_name,m=msg))
def is_int(s):
try:
int(s)
return True
except ValueError:
return False
def peekline(f):
pos = f.tell()
line = f.readline()
f.seek(pos)
return line
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment