Commit cf8bc06b by DLA-Ranker

Updates

parent 08302f9a
name: null
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- bokeh=1.4
- cmake=3.16 # insures that Gloo library extensions will be built
- cudnn=7.6
- cupti=10.1
- cxx-compiler=1.0 # insures C and C++ compilers are available
- jupyterlab=1.2
- mpi4py=3.0 # installs cuda-aware openmpi
- nccl=2.5
- nodejs=13
- nvcc_linux-64=10.1 # configures environment to be "cuda-aware"
- pip=20.0
- pip:
- mxnet-cu101mkl==1.6.* # MXNET is installed prior to horovod
- -r file:requirements.txt
- python=3.7
- pytorch=1.4
- tensorboard=2.1
- tensorflow-gpu=2.1
- torchvision=0.5
Comp;ch1;ch2;Conf;Class
1AK4;A;B;1AK4_cm-it0_8585;0
1AK4;A;B;1AK4_cm-it0_9733;0
2I25;A;B;2I25_ti5-it1_46;1
# Auto detect text files and perform LF normalization
* text=auto
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
MIT License
Copyright (c) 2022 Simon Crouzet
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# InteractionGNN
Source code to run InteractionGNN on protein-protein interfaces
### Usage
1. Set the desired parameters in *config.ini*
2. in a bash shell, run `python interaction_gnn.py`
### Requirements
##### Packages:
- Manual: install Pytorch, Pytorch-Geometric, Cuda-Toolkit, Scikit-Learn and the packages numpy pandas matplotlib lz4 and tqdm (`conda install -c pytorch -c pyg -c conda-forge python=3.9 numpy pandas matplotlib tqdm pytorch pyg scikit-learn cuda-toolkit lz4`)
- All-in-one: Run `conda create --name interaction_gnn --file interaction_gnn.yml`
\
InteractionGNN is using [Pytorch-Geometric]([github.com/pyg-team/pytorch_geometric](https://github.com/pyg-team/pytorch_geometric)).
##### Data files:
Should be in the folder data, displayed like the following example for binary classification:
\
```
InteractionGNN
| interaction_gnn.py
|
|___src
| | ...
|
|___data
|___protein_pair_1
| |___0
| | | file1
| | | file2
| |
| |___1
| | file3
| | file4
|
|___protein_pair_2
| |___0
| | | file5
| | | file6
| |
| |___1
| | file7
| | file8
..........
```
### Citing
If you use this code, please cite the associated paper:
\
```Y. Mohseni Behbahani, S. Crouzet, E. Laine, A. Carbone, *Deep Local Analysis evaluates protein docking conformations with locally oriented cubes*```
\ No newline at end of file
[RUNINFO]
save_models = True
data_dir = ./data/intermediate_m5_e30_nonorm_split1/
[MODELINFO]
nn_model_string = GCN
learning_rate = 0.001
nb_negative = -1
nb_positive = -1
nb_features = 22
specific_scr_fea = S,C,R
balance_test_split = True
exclude_last = True
[RUNPARAMS]
use_kfold = True
epochs = 100
batch_size = 500
nb_folds = 5
use_weights = True
\ No newline at end of file
import copy
import os
import numpy as np
import random
import torch
import matplotlib.pyplot as plt
import pandas as pd
import torch_geometric.data as geom_data
import torch_geometric.loader as geom_loader
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import Linear, Dropout
from src.data import ProteinDataset, StratifiedSplit, kFoldStratifiedCrossValidation, reduce_split
from torch_geometric.data import Data, Dataset, Batch, DataLoader
from src.model import GCN, GAT
from src.utils import as_numpy, as_numpy_tuple, plot_roc_pr_curves
from src.metrics import ScoreDataframe
from datetime import datetime
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve
from src.config import str_to_bool, str_to_model, read_config
from sklearn.utils.class_weight import compute_class_weight
print('================= InteractionGNN =================')
#Read config.ini file
runinfo, modelinfo, runparams = read_config('config.ini')
# parameters
start_time = datetime.now()
# Configuration, from the config.ini file
NN_model_string = modelinfo["NN_model_string"]
NN_model = str_to_model(NN_model_string) # GCN or GAT or Linear
save_models = str_to_bool(runinfo["save_models"])
epochs = int(runparams["epochs"])
batch_size = int(runparams["batch_size"])
nb_folds = int(runparams["nb_folds"])
model_no = runinfo["model_no"] # id for results
data_dir = runinfo["data_dir"] # data directory
nb_negative = int(modelinfo["nb_negative"]) # -1 means 'No limit for negative samples'
nb_positive = int(modelinfo["nb_positive"]) # -1 means 'No limit for positive samples'
use_kfold = str_to_bool(runparams["use_kfold"])
nb_features = int(modelinfo["nb_features"])
use_weights = str_to_bool(runparams["use_weights"])
# Specific SCR regions should be of format 'S,C,R' in the config file
specific_scr_fea = modelinfo['specific_scr_fea'].split(',')
balance_test_split = str_to_bool(modelinfo["balance_test_split"])
exclude_last = str_to_bool(modelinfo["exclude_last"])
data_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), data_dir))
if not os.path.exists('results'):
os.mkdir('results')
print('Initializing dataset from {}'.format(data_dir))
seed = 123 # Seed to fix the simulation
random.seed(seed)
lr = float(modelinfo["learning_rate"]) # Learning rate for optimizer
# normalize train -> remember mean, std to use for val and test
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Build dataset
dataset = ProteinDataset(data_dir, random_seed=seed, device=device, nb_features=nb_features, use_scr=True, specific_scr_fea=specific_scr_fea)
# Reduce dataset to desired nb of samples
dataset.reduce(nb_negative=100, nb_positive=100)
print('Dataset built with {} samples - {} features and {} classes'.format(dataset.__len__(), dataset.nb_features, dataset.nb_classes))
if dataset.nb_classes == 2:
print('\tDataset contains {} P and {} N'.format(dataset.nb_samples[1], dataset.nb_samples[0]))
print('Loading Data...')
# TODO: Test parallelism to speed up data loading
# Creating sets
if use_kfold:
print('\tCreating train/val/test sets with k-fold cross validation...')
dataset_train, dataset_test = StratifiedSplit(dataset, first_partition=0.85, second_partition=0.15, shuffle=True)
dataset_val = []
train_len = (len(dataset_train)*(nb_folds -1)) / nb_folds
val_len = (len(dataset_train)) / nb_folds
test_len = len(dataset_test)
# Create folds
kFoldStrCV = kFoldStratifiedCrossValidation(dataset_train, nb_folds=nb_folds, nb_classes=2, shuffle=True)
iter_kFold = iter(kFoldStrCV)
range_fold = range(nb_folds)
else:
print('\tCreating train/val/test sets (without CV)...')
dataset_train, dataset_test_val = StratifiedSplit(dataset, first_partition=0.7, second_partition=0.3, shuffle=True)
dataset_test, dataset_val = StratifiedSplit(dataset_test_val, first_partition=0.5, second_partition=0.5, shuffle=True, in_memory=True, nb_classes=dataset.nb_classes)
del dataset_test_val # Unused heavy data
train_len = len(dataset_train)
test_len = len(dataset_test)
val_len = len(dataset_val)
range_fold = range(1)
assert(len(dataset_train)+len(dataset_val) > len(dataset_test)), 'Test set is surprisingly large!'
# Balancing test split
if balance_test_split:
print('\tBalancing test set...')
dataset_test = reduce_split(dataset_test, limits={0:100, 1:100}, nb_classes=dataset.nb_classes)
test_len = len(dataset_test)
dim = dataset_train[0].x.shape[1]
# Save pairs used in different sets
max_length = len(dataset_train)+len(dataset_val)
split_df = pd.DataFrame({'all_train':[d.pair for d in dataset_train+dataset_val],'test':[dataset_test[i].pair if i < len(dataset_test) else '' for i in range(max_length)]})
print('Splits created with {} train/{} val/{} test samples'.format(train_len, val_len, test_len))
# create model
# Set neural_network architecture - to be changed on the flight
print('Instanciating Graph Neural Network architecture...')
kwargs = {}
base_model = NN_model(**kwargs, dim=dim)
models = [copy.deepcopy(base_model).to(device) for _ in range_fold]
print('\tModel {} instanciated on device {}: {}'.format(NN_model, device, models[0]))
#model.init_weights(torch.nn.init.xavier_normal_, 0.1)
# main loop
hist = {"train_loss": np.array([0.0 for e in range(epochs)]), "val_loss": np.array([0.0 for e in range(epochs)]), "val_acc": np.array([0.0 for e in range(epochs)])}
scores = ScoreDataframe()
metrics_val = {'fprs':[], 'tprs':[], 'precisions':[], 'recalls':[], 'aucs':np.zeros(nb_folds)}
for k in range_fold:
print('=========')
print(f'Fold {k + 1}:')
print('=========')
print()
optimizer = torch.optim.Adam(models[k].parameters(), lr=lr, weight_decay=5e-4) # Define optimizer.
# Preparing test and val sets
if use_kfold:
train_set, val_set = iter_kFold.__next__()
else:
train_set, val_set = dataset_train, dataset_val
# Compute weights for each class. Since validation loss is only informative, use weights from train set only.
if use_weights:
weights = torch.Tensor(compute_class_weight('balanced', classes=[0,1], y=[d.y for d in train_set])).to(device)
else:
weights = torch.Tensor(compute_class_weight(None, classes=[0,1], y=[d.y for d in train_set])).to(device)
split_df.insert(len(split_df.columns), 'train_fold{}'.format(k), [train_set[i].pair if i < len(train_set) else '' for i in range(max_length)])
split_df.insert(len(split_df.columns), 'val_fold{}'.format(k), [val_set[i].pair if i < len(val_set) else '' for i in range(max_length)])
# Divide sets into batches
graph_train_loader = geom_loader.DataLoader(train_set, batch_size=batch_size, shuffle=True)
graph_val_loader = geom_loader.DataLoader(val_set, batch_size=batch_size, shuffle=True) # Additional loader if you want to change to a larger dataset
# Running model
for epoch in range(epochs):
print('---------')
print(f'epoch {epoch + 1}:')
train_loss = 0.0
val_loss = 0.0
val_acc = 0.0
outputs, label = np.empty((0,dataset.nb_classes), dtype=np.float32), np.empty((0), dtype=np.int64) # for ROC PR plots of val dataset
# Train
for step, data in enumerate(graph_train_loader):
# Moving data to GPU memory
data = data.to(device)
# Forward pass
train_pred = as_numpy_tuple(models[k].trainGraph(data, optimizer, device, weights=weights))
train_loss += train_pred.loss
#Move back data to CPU and flush GPU memory
if device.type != 'cpu':
torch.cuda.empty_cache()
# if save_models:
# scores.add_rows(conformation_names=data.pair, folds=[k for i in range(len(train_pred.gt))], scores=train_pred.pred[:, 1], time_of_predictions=[np.NaN for i in range(len(train_pred.gt))], targets=train_pred.gt, epoch=epoch)
# Validation
for step, data in enumerate(graph_val_loader):
# Moving data to GPU memory
data = data.to(device)
# Forward pass
val_pred = as_numpy_tuple(models[k].predictGraph(data, device))
val_loss += val_pred.loss
val_acc += val_pred.correct
if save_models:
scores.add_rows(conformation_names=data.pair, folds=[k for i in range(len(val_pred.gt))], scores=val_pred.pred[:, 1], time_of_predictions=[np.NaN for i in range(len(val_pred.gt))], targets=val_pred.gt, epoch=epoch)
# data for plot # TODO: Fix val plot for batch size > 1
outputs = np.concatenate((outputs, val_pred.pred), axis=0)
label = np.concatenate((label, val_pred.gt), axis=0)
# Move back data to CPU and flush GPU memory
if device.type != "cpu":
torch.cuda.empty_cache()
# Little bias here: loss are divided by number of batches, therefore the last batch (if smaller than batch_size) had more importance than the others.
train_loss /= len(graph_train_loader)
val_loss /= len(graph_val_loader)
val_acc = val_acc / len(val_set)
print('Training loss: ', train_loss)
print('Val loss: ', val_loss)
print('Val accuracy: %4.2f%%' % (100.0*float(val_acc)))
print('---------')
hist['train_loss'][epoch] += train_loss
hist['val_loss'][epoch] += val_loss
hist['val_acc'][epoch] += val_acc
# save model
# if save_models:
# torch.save(models[k], 'results/' + NN_model_string + '_' + model_no + '_model_epoch_' + str(epoch+1) + '.pt') # , PATH
if dataset.nb_classes == 2:
fpr, tpr, _ = roc_curve(y_true=label, y_score=outputs[:,1])
auc = roc_auc_score(y_true=label, y_score=outputs[:,1])
precision, recall, fscore = precision_recall_curve(y_true=label, probas_pred=outputs[:,1])
metrics_val['fprs'].append(fpr)
metrics_val['tprs'].append(tpr)
metrics_val['aucs'][k] = auc
metrics_val['precisions'].append(precision)
metrics_val['recalls'].append(recall)
plot_roc_pr_curves(metrics_val, 'results/roc_pr_val_curve_' + NN_model_string+ '_' + model_no + '.svg', plot_roc_lines=True)
if use_kfold:
hist['train_loss'] = np.divide(hist['train_loss'], nb_folds)
hist['val_loss'] = np.divide(hist['val_loss'], nb_folds)
hist['val_acc'] = np.divide(hist['val_acc'], nb_folds)
# test NN using test dataset
print('--------------------------')
print('Testing on test dataset...')
# Divide set into batches
graph_test_loader = geom_loader.DataLoader(dataset_test, batch_size=1)
test_loss = 0
test_acc = 0
metrics_test = {'base_fpr':np.linspace(0, 1, len(dataset_test)), 'tprs':np.zeros((nb_folds, len(dataset_test))), 'precisions':[], 'recalls':[], 'aucs':np.zeros(nb_folds)}
for k in range_fold:
outputs, label = np.empty((0,dataset.nb_classes), dtype=np.float32), np.empty((0), dtype=np.int64) # for ROC PR plots of val dataset
for step, data in enumerate(graph_test_loader):
# Moving data to GPU memory
data = data.to(device)
# Forward pass
test_pred = as_numpy_tuple(models[k].predictGraph(data, device))
test_loss += test_pred.loss
test_acc += test_pred.correct
outputs = np.concatenate((outputs, val_pred.pred), axis=0)
label = np.concatenate((label, val_pred.gt), axis=0)
if save_models:
scores.add_rows(conformation_names=data.pair, folds=[-1 for i in range(len(test_pred.gt))], scores=test_pred.pred[:, 1], time_of_predictions=[np.NaN for i in range(len(test_pred.gt))], targets=test_pred.gt, epoch=-1)
# Move back data to CPU and flush GPU memory
if device.type != "cpu":
torch.cuda.empty_cache()
if dataset.nb_classes == 2:
fpr, tpr, _ = roc_curve(y_true=label, y_score=outputs[:,1])
auc = roc_auc_score(y_true=label, y_score=outputs[:,1])
precision, recall, fscore = precision_recall_curve(y_true=label, probas_pred=outputs[:,1])
interp_tpr = np.interp(metrics_test['base_fpr'], fpr, tpr)
interp_tpr[0] = 0.0
metrics_test['tprs'][k] = interp_tpr
metrics_test['aucs'][k] = auc
metrics_test['precisions'].append(precision)
metrics_test['recalls'].append(recall)
plot_roc_pr_curves(metrics_test, 'results/roc_pr_test_curve_' + NN_model_string+ '_' + model_no + '.svg')
test_loss /= (len(graph_test_loader) * nb_folds)
test_acc /= (len(dataset_test) * nb_folds)
print('Test loss: ', test_loss)
print('Test accuracy: %4.2f%%' % (100.0*float(test_acc)))
print('--------------------------')
# plot figure
fig, ax = plt.subplots(1,1)
ax.plot([e for e in range(0, epochs)], hist["train_loss"], label="train_loss")
ax.plot([e for e in range(0, epochs)], hist["val_loss"], label="val_loss")
ax.plot([e for e in range(0, epochs)], hist["val_acc"], label="val_acc")
#ax.set_xticks(np.arange(0, epochs, 1.0))
#ax.set_xticklabels(np.arange(1, epochs+1, 1))
ax.set_xlabel("epoch")
ax.legend()
fig.savefig('results/' + NN_model_string + '_' + model_no + '.svg')
# info of run
split_df.to_csv('results/splits_' + NN_model_string+ '_' + model_no + '.csv', mode='a', header=True)
scores.export('results/prediction_results_' + NN_model_string + '_' + model_no +'.csv')
f= open("results/" + NN_model_string + "_" + model_no + ".txt", "w+")
f.write("Data from " + data_dir + "\n")
f.write("NN_model: " + NN_model_string + "\n")
f.write("model: " + str(models[0]) + "\n")
f.write("epochs: " + str(epochs) + "\n")
f.write("batch size: " + str(batch_size) + "\n")
end_time = datetime.now()
f.write("duration: " + str(end_time-start_time) + "\n")
f.write("size train dataset: " + str(train_len) + "\n")
f.write("size validation dataset: " + str(val_len) + "\n")
f.write("size test dataset: " + str(test_len) + "\n")
# class proportions
f.close()
# time
print('Duration: {}'.format(end_time - start_time))
\ No newline at end of file
name: interaction_gnn
channels:
- pytorch
- pyg
- nvidia
- bioconda
- conda-forge
- defaults
dependencies:
- blas=2.113=mkl
- blas-devel=3.9.0=13_win64_mkl
- brotli=1.0.9=h8ffe710_6
- brotli-bin=1.0.9=h8ffe710_6
- brotlipy=0.7.0=py39hb82d6ee_1003
- bzip2=1.0.8=h8ffe710_4
- ca-certificates=2021.10.8=h5b45459_0
- certifi=2021.10.8=py39hcbf5309_2
- cffi=1.15.0=py39h0878f49_0
- charset-normalizer=2.0.12=pyhd8ed1ab_0
- colorama=0.4.4=pyh9f0ad1d_0
- cryptography=36.0.2=py39h7bc7c5c_0
- cuda-cccl=11.6.55=hd268e57_0
- cuda-command-line-tools=11.6.2=h65bbf44_0
- cuda-compiler=11.6.2=h65bbf44_0
- cuda-cudart=11.6.55=h5fb1900_0
- cuda-cuobjdump=11.6.124=h8654613_0
- cuda-cupti=11.6.124=h532822a_0
- cuda-cuxxfilt=11.6.124=h3f9c74b_0
- cuda-libraries=11.6.2=h65bbf44_0
- cuda-libraries-dev=11.6.2=h65bbf44_0
- cuda-memcheck=11.6.124=hea6bc18_0
- cuda-nvcc=11.6.124=h769bc0d_0
- cuda-nvdisasm=11.6.124=he05ff55_0
- cuda-nvml-dev=11.6.55=h2bb381e_0
- cuda-nvprof=11.6.124=he581663_0
- cuda-nvprune=11.6.124=hb892de1_0
- cuda-nvrtc=11.6.124=h231bd66_0
- cuda-nvrtc-dev=11.6.124=hd7d06dc_0
- cuda-nvtx=11.6.124=hee9d5a4_0
- cuda-nvvp=11.6.124=h6a974fa_0
- cuda-sanitizer-api=11.6.124=ha4888a7_0
- cuda-toolkit=11.6.2=h65bbf44_0
- cuda-tools=11.6.2=h65bbf44_0
- cuda-visual-tools=11.6.2=h65bbf44_0
- cudatoolkit=11.5.0=hfde6d1b_9
- cycler=0.11.0=pyhd8ed1ab_0
- fonttools=4.25.0=pyhd3eb1b0_0
- freetype=2.10.4=h546665d_1
- icu=69.1=h0e60522_0
- idna=3.3=pyhd8ed1ab_0
- intel-openmp=2022.0.0=h57928b3_3663
- jbig=2.1=h8d14728_2003
- jinja2=3.1.1=pyhd8ed1ab_0
- joblib=1.1.0=pyhd8ed1ab_0
- jpeg=9e=h8ffe710_0
- kiwisolver=1.4.2=py39h2e07f2f_0
- lcms2=2.12=h2a16943_0
- lerc=3.0=h0e60522_0
- libblas=3.9.0=13_win64_mkl
- libbrotlicommon=1.0.9=h8ffe710_6
- libbrotlidec=1.0.9=h8ffe710_6
- libbrotlienc=1.0.9=h8ffe710_6
- libcblas=3.9.0=13_win64_mkl
- libclang=13.0.1=default_h81446c8_0
- libcublas=11.9.2.110=hb6c9e1a_0
- libcublas-dev=11.9.2.110=h1662081_0
- libcufft=10.7.2.124=hfa55d67_0
- libcufft-dev=10.7.2.124=hfb516c1_0
- libcurand=10.2.9.124=h99c9f72_0
- libcurand-dev=10.2.9.124=he731d31_0
- libcusolver=11.3.4.124=hc34ff3b_0
- libcusolver-dev=11.3.4.124=h4c0dfd9_0
- libcusparse=11.7.2.124=hfe5ea2b_0
- libcusparse-dev=11.7.2.124=h39db74a_0
- libdeflate=1.10=h8ffe710_0
- libffi=3.4.2=h8ffe710_5
- liblapack=3.9.0=13_win64_mkl
- liblapacke=3.9.0=13_win64_mkl
- libnpp=11.6.3.124=h516bb01_0
- libnpp-dev=11.6.3.124=hc355075_0
- libnvjpeg=11.6.2.124=hf97cc0b_0
- libnvjpeg-dev=11.6.2.124=h6c8d1d7_0
- libpng=1.6.37=h1d00b33_2
- libtiff=4.3.0=hc4061b1_3
- libuv=1.43.0=h8ffe710_0
- libwebp=1.2.2=h57928b3_0
- libwebp-base=1.2.2=h8ffe710_1
- libxcb=1.13=hcd874cb_1004
- libzlib=1.2.11=h8ffe710_1014
- lz4=3.1.10=py39h92e281b_0
- lz4-c=1.9.3=h8ffe710_1
- m2w64-gcc-libgfortran=5.3.0=6
- m2w64-gcc-libs=5.3.0=7
- m2w64-gcc-libs-core=5.3.0=7
- m2w64-gmp=6.1.0=2
- m2w64-libwinpthread-git=5.0.0.4634.697f757=2
- markupsafe=2.1.1=py39hb82d6ee_1
- matplotlib=3.5.1=py39hcbf5309_0
- matplotlib-base=3.5.1=py39h581301d_0
- mkl=2022.0.0=h0e2418a_796
- mkl-devel=2022.0.0=h57928b3_797
- mkl-include=2022.0.0=h0e2418a_796
- msys2-conda-epoch=20160418=1
- munkres=1.0.7=py_1
- networkx=2.7.1=pyhd8ed1ab_1
- numpy=1.22.3=py39h6331f09_0
- openjpeg=2.4.0=hb211442_1
- openssl=1.1.1n=h8ffe710_0
- packaging=21.3=pyhd8ed1ab_0
- pandas=1.4.1=py39h2e25243_0
- pillow=9.0.1=py39ha53f419_2
- pip=22.0.4=pyhd8ed1ab_0
- pthread-stubs=0.4=hcd874cb_1001
- pycparser=2.21=pyhd8ed1ab_0
- pyg=2.0.4=py39_torch_1.11.0_cu115
- pyopenssl=22.0.0=pyhd8ed1ab_0
- pyparsing=3.0.7=pyhd8ed1ab_0
- pyqt=5.12.3=py39hb0d2dfa_4
- pysocks=1.7.1=py39hcbf5309_4
- python=3.9.12=h9a09f29_1_cpython
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python-louvain=0.15=pyhd8ed1ab_1
- python_abi=3.9=2_cp39
- pytorch=1.11.0=py3.9_cuda11.5_cudnn8_0
- pytorch-cluster=1.6.0=py39_torch_1.11.0_cu115
- pytorch-mutex=1.0=cuda
- pytorch-scatter=2.0.9=py39_torch_1.11.0_cu115
- pytorch-sparse=0.6.13=py39_torch_1.11.0_cu115
- pytorch-spline-conv=1.2.1=py39_torch_1.11.0_cu115
- pytz=2022.1=pyhd8ed1ab_0
- pyyaml=6.0=py39hb82d6ee_4
- qt=5.12.9=h556501e_6
- requests=2.27.1=pyhd8ed1ab_0
- scikit-learn=1.0.2=py39he931e04_0
- scipy=1.8.0=py39hc0c34ad_1
- setuptools=61.3.0=py39hcbf5309_0
- six=1.16.0=pyh6c4a22f_0
- sqlite=3.37.1=h8ffe710_0
- tbb=2021.5.0=h2d74725_0
- threadpoolctl=3.1.0=pyh8a188c0_0
- tk=8.6.12=h8ffe710_0
- tornado=6.1=py39hb82d6ee_3
- tqdm=4.63.1=pyhd8ed1ab_0
- typing_extensions=4.1.1=pyha770c72_0
- tzdata=2022a=h191b570_0
- ucrt=10.0.20348.0=h57928b3_0
- urllib3=1.26.9=pyhd8ed1ab_0
- vc=14.2=hb210afc_6
- vs2015_runtime=14.29.30037=h902a5da_6
- wheel=0.37.1=pyhd8ed1ab_0
- win_inet_pton=1.1.0=py39hcbf5309_4
- xorg-libxau=1.0.9=hcd874cb_0
- xorg-libxdmcp=1.1.3=hcd874cb_0
- xz=5.2.5=h62dcd97_1
- yacs=0.1.8=pyhd8ed1ab_0
- yaml=0.2.5=h8ffe710_2
- zlib=1.2.11=h8ffe710_1014
- zstd=1.5.2=h6255e5f_0
- pip:
- pyqt5-sip==4.19.18
- pyqtchart==5.12
- pyqtwebengine==5.12.1
prefix: C:\ProgramData\Anaconda3\envs\interaction_gnn
from src.model import GAT, GCN, LinearNetwork
from configparser import ConfigParser, Error
from datetime import datetime
def read_config(path):
config_object = ConfigParser()
config_object.read(path)
# datetime object containing current date and time
now = datetime.now()
dt_string = now.strftime("%d_%m_%Y_%Hh%Mm%Ss")
model_no = "run_" + str(dt_string)
runinfo = config_object['RUNINFO']
modelinfo = config_object['MODELINFO']
runparams = config_object['RUNPARAMS']
runinfo['model_no'] = model_no
return runinfo, modelinfo, runparams
# functions to read config file
def str_to_bool(s):
if s == 'True':
return True
elif s == 'False':
return False
def str_to_model(s):
if s == "GCN":
return GCN
elif s == "GAT":
return GAT
elif s == "Linear":
return LinearNetwork
else:
raise Error("Model error: {} not recognized as a model".format(s))
\ No newline at end of file
import os, functools, math, csv, random, sys
import json as json
import warnings
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import torch.utils.data
import torch
from torch_geometric.data import Data, Dataset, Batch, DataLoader
from src.utils import load_with_compression, euclidian_distance
import logging
from src.dataset_utils import *
#basic logging config
logging.basicConfig(filename="data.log",level=logging.DEBUG)
"""
Dataset class to handle files from protein interfaces.
Data are not loaded yet, it needs to be with get(index).
Returns:
Dataset: A dataset containing protein data files
"""
class ProteinDataset(Dataset):
def __init__(self, data_dir, threshold=10.0, random_seed=123,
device=torch.device('cpu'), nb_features=None, nb_classes=2, exclude_last=False, use_scr=True, specific_scr_fea=None):
super().__init__()
# Set device
self.device = device
# Set data directory
assert os.path.isdir(data_dir), '{} does not exist!'.format(data_dir)
self.data_dir = data_dir
# Set number of features (from the sample) and number of classes
self.nb_features = nb_features
self.nb_classes = nb_classes
# Param to exclude last feature of samples - for example if the sample files already contain scores from another algorithm
self.exclude_last = exclude_last
# Set specific SCR regions to use - otherwise, use all
if specific_scr_fea is not None:
if type(specific_scr_fea) is not list:
specific_scr_fea = [specific_scr_fea]
for scr_fea in specific_scr_fea:
if scr_fea not in ['R', 'C', 'S']:
raise ValueError('{} is not a valid SCR feature - must be \'R\', \'C\' or \'S\''.format(scr_fea))
self.specific_scr_fea = specific_scr_fea
self.process_samples()
random.seed(random_seed)
self.dist_calc = euclidian_distance
self.threshold = threshold
self.use_scr = use_scr
def process_samples(self, samples_select=None):
# Process samples by splitting them by class
self.pairs = {c:[] for c in range(self.nb_classes)}
for p in os.listdir(self.data_dir):
if os.path.isdir(os.path.join(self.data_dir, p)):
for c in range(self.nb_classes):
if os.path.isdir(os.path.join(self.data_dir, p, str(c))):
# Store a tuple (protein pair, nb of samples) for each class
self.pairs[c].append((p,len(os.listdir(os.path.join(self.data_dir, p, '{}'.format(c))))))
# If samples were selected, keep only those
if samples_select is not None:
self.pairs = {c:[] for c in range(self.nb_classes)}
for s_idx in range(len(samples_select)):
for p,n in samples_select[s_idx].items():
# Store a tuple (protein pair, nb of samples) for each class
self.pairs[s_idx].append((p,n))
# Process final samples with name of files containing them
self.samples = {c:{} for c in range(self.nb_classes)}
self.nb_samples = {c:0 for c in range(self.nb_classes)}
self.samples_list = []
for c in self.pairs.keys():
for p,n in self.pairs[c]:
# Recreate name of files from protein pair and class
samples_p = [f for f in os.listdir(os.path.join(self.data_dir, p, '{}'.format(c)))]
# Set limit (selected samples, or all)
limit = min(n, len(samples_p))
for i in range(limit):
# Store dict of samples, per class and per protein pair
self.samples[c][p] = samples_p[i]
# Store count of sample per class
self.nb_samples[c] += 1
# Store list of samples with the filename, the class and the protein pair
self.samples_list.append((samples_p[i],c,p))
def reduce(self, nb_negative, nb_positive=-1):
# Reduce the dataset to a specific number of positive/negative samples
# Only for binary classification!
if self.nb_classes != 2:
raise NotImplementedError('ProteinDataset: reduce() is not implemented for non-binary classification')
if nb_negative == -1 and nb_positive == -1:
return
if nb_positive == -1 or nb_positive > self.nb_samples[1]:
warnings.warn('Reducing: Not enough positive samples in the dataset')
nb_positive = self.nb_samples[1]
if nb_negative == -1 or nb_negative > self.nb_samples[0]:
warnings.warn('Reducing: Not enough negative samples in the dataset')
nb_negative = self.nb_samples[0]
count_samples = {0:0, 1:0}
# Keep only the first nb_positive/nb_negative samples
pairs_neg_select = {p:0 for p,n in self.pairs[0]}
pairs_pos_select = {p:0 for p,n in self.pairs[1]}
# Select negative samples by protein pair
for p, nb_s in self.pairs[0]:
if count_samples[0] < nb_negative:
if count_samples[0] + nb_s <= nb_negative:
pairs_neg_select[p] = nb_s
else:
pairs_neg_select[p] = nb_negative - count_samples[0]
count_samples[0] += pairs_neg_select[p]
# Select positive samples by protein pair
for p, nb_s in self.pairs[1]:
if count_samples[1] < nb_positive:
if count_samples[1] + nb_s <= nb_positive:
pairs_pos_select[p] = nb_s
else:
pairs_pos_select[p] = nb_positive - count_samples[1]
count_samples[1] += pairs_pos_select[p]
# Process samples again with the new selection
self.process_samples(samples_select=(pairs_neg_select, pairs_pos_select))
def __len__(self):
return sum([v for v in self.nb_samples.values()])
def one_hot_rcs(self, df, col_idx):
# Encode the RCS features in one-hot encoding
if self.use_scr:
unique_fea = pd.array(['R', 'C', 'S'])
for u in unique_fea:
df[u] = df.apply(lambda row: int(row[col_idx] == u), axis=1)
# Remove the original column
df = df.drop(col_idx, axis=1)
return df
def __getitem__(self, idxs):
if type(idxs) is int:
# One sample requested
return self.get(idxs)
elif type(idxs) is list or tuple:
# Multiple samples requested
return [self.get(idx) for idx in idxs if self.get(idx) is not None]
else:
raise ValueError('Dataset.__getitem__() can\'t return items from indexes of type ', type(idxs))
def get_idx(self, idx):
# Get a sample from its index
return self.get(idx)
def get(self, idx):
# Compute sample from the stored informations
# Extract filename, class and protein pair from the index
sample, label, pair = self.samples_list[idx]
path = os.path.join(self.data_dir, pair, str(label), sample)
# Load the sample from its file
df = pd.read_csv(path, sep=',',header=None, index_col=None)
# Drop NaN columns (if any)
df.dropna(axis=1, how='all', inplace=True)
# Analyze size of the dataset
nb_rows, nb_cols = df.shape[0], df.shape[1]
nb_features = nb_cols - 3
atoms_features = []
edges_from = []
edges_to = []
if nb_features != self.nb_features:
raise ValueError('Dataset.get(): Wrong number of features in the dataset (expected {}, got {})'.format(self.nb_features, nb_features))
if self.exclude_last:
# Exclude last feature - for example if the sample files already contain scores from another algorithm
df = df.drop(nb_cols-1, axis=1)
nb_cols -= 1
if df[3].dtype == np.dtype('object'):
# Dataset is containing RCS regions - encode them
if self.specific_scr_fea:
# If specific RCS regions are requested, keep only those
df = df[df[3].isin(self.specific_scr_fea)].reset_index(drop=True)
nb_rows = df.shape[0]
df = self.one_hot_rcs(df=df, col_idx=3)
else:
raise ValueError('SCR features should be in index 3')
if len(df) == 0:
warnings.warn('Dataset.get(): Empty data for sample {}'.format(sample))
return None
for i in range(nb_rows):
# For each row (residue, atom or cubes), extract the features
atoms_features.append(df.loc[i][3:].values)
# Extract neighbours and build edges
for j in range(nb_rows):
if i != j:
distance = self.dist_calc(df.loc[i][:3].values, df.loc[j][:3].values)
if distance <= self.threshold:
edges_from.append(i)
edges_to.append(j)
protein_atom_fea = torch.Tensor(np.array(atoms_features)) # Atom features
if self.nb_features is not None:
if self.use_scr:
expected_features = self.nb_features + 2 # 1 RCS feature was encoded into a vector of length 3
else:
expected_features = self.nb_features - 1 # 1 RCS feature was dropped
# Check if the features shape is correct
if (not self.exclude_last and protein_atom_fea.shape[1] != expected_features) or (self.exclude_last and protein_atom_fea.shape[1] != expected_features - 1):
logging.info('Dataset: Sample {}_{} from pair {} did not possess the number of features expected'.format(sample, label, pair))
return None
protein_edges_idx = torch.LongTensor(np.array([edges_from, edges_to])) # Edge connectivity - shape [2, num_edges]
# Class Data from torch_geometric.data, with additional attributes 'protein pair' and 'sample filename'
return Data(x=protein_atom_fea, edge_index=protein_edges_idx, y=label, idx=idx, pair=pair, sample=os.path.splitext(sample)[0])
def StratifiedSplit(dataset, first_partition=1.0, second_partition=0.0, shuffle=False, seed=random.seed(), in_memory=False, nb_classes=2):
# Custom stratified split function for the dataset
dataset_len = len(dataset)
if first_partition + second_partition != 1.0:
raise ValueError('StratifiedSplit error: Sum of split percentages must be equals to 1.0.')
data_dict = {}
# Create "class => list of data" Dict
# => dataset.get(i) is calculating and loading the data, so it is computationally heavy!
for i in tqdm(range(dataset_len), desc='Splitting data...', disable=in_memory):
if not in_memory: # Data need to be loaded - computationally heavy!
data = dataset.get(i)
else: # Data were already loaded - light
data = dataset[i]
if data is not None:
if data.pair not in data_dict:
data_dict[data.pair] = {k:[] for k in range(nb_classes)}
data_dict[data.pair][data.y].append(data)
# Build splits
first_split, second_split = [], []
np.random.seed(seed)
acc_keys = [k for k in data_dict.keys()]
split_pcount = 0
split1_limit = len(acc_keys) * first_partition
if shuffle:
random.shuffle(acc_keys) # Shuffle keys
for p in acc_keys:
# Create first split
if split_pcount < round(split1_limit) or first_partition == 1.0:
for k in range(nb_classes):
for s in data_dict[p][k]:
first_split.append(s)
split_pcount += 1
# Create second split
else:
for k in range(nb_classes):
for s in data_dict[p][k]:
second_split.append(s)
del data_dict # To free the memory used
# Shuffle again to shuffle classes within the sets
if shuffle:
random.shuffle(first_split)
random.shuffle(second_split)
return first_split, second_split
def reduce_split(split, limits={0:100, 1:100}, nb_classes=2):#
# Reduce a split to a specific number of samples - e.g. to have a balanced validation set
pruned_split = []
count_reduced = {k:0 for k in range(nb_classes)}
for data in split:
if count_reduced[data.y] < limits[data.y]:
pruned_split.append(data)
count_reduced[data.y] += 1
return pruned_split
class kFoldStratifiedCrossValidation:
def __init__(self, dataset, nb_folds=5, nb_classes=2, shuffle=False, seed=random.seed()):
# Custom stratified cross-validation function for the dataset
if len(dataset) < nb_folds*2:
raise ValueError('kFold error: please use kFold with enough data in the dataset.')
self.dataset = dataset
if shuffle:
random.shuffle(dataset)
self.nb_folds = nb_folds
self.nb_classes = nb_classes
self.dataset_len = len(dataset)
np.random.seed(seed)
self.shuffle = shuffle
# Decide if the validation set as to be balanced or not
self.folds = [[] for f in range(nb_folds)]
# Build folds
self.build_folds()
def build_folds(self):
data_dict = {}
# Create "class => list of data" Dict
for i in tqdm(range(self.dataset_len), desc='Constructing kFold...'):
data = self.dataset[i]
if data.pair not in data_dict:
data_dict[data.pair] = {k:[] for k in range(self.nb_classes)}
data_dict[data.pair][data.y].append(data)
acc_keys = [k for k in data_dict.keys()]
nb_samples = len(acc_keys)
if self.shuffle:
random.shuffle(acc_keys) # Shuffle keys
nb_samples_per_fold = nb_samples / self.nb_folds
if nb_samples < self.nb_folds:
raise ValueError('kFold error: please use kFold with enough data in the dataset.')
for i in range(nb_samples):
i_fold = int(i / nb_samples_per_fold)
if i_fold > self.nb_folds -1:
raise ValueError('Unexpected bug during StratifiedSplit initialization')
# Push sample into the fold
for t,s in data_dict[acc_keys[i]].items():
self.folds[i_fold].extend(s)
# Shuffle folds
for f in self.folds:
if self.shuffle:
random.shuffle(f)
if self.shuffle:
random.shuffle(self.folds)
del data_dict # To free the memory used
def __iter__(self):
# Iterate over folds
self.val_idx = -1
return self
def __next__(self): # return (train_set, val_set)
if self.val_idx >= self.nb_folds-1:
raise StopIteration
self.val_idx = (self.val_idx + 1) % self.nb_folds
# Create train and validation sets regarding the fold index
train_set, val_set = [], []
for i_fold in range(self.nb_folds):
if i_fold == self.val_idx:
val_set.extend(self.folds[i_fold])
else:
train_set.extend(self.folds[i_fold])
return train_set, val_set
import ast, os
def transform_Dockground_name(pair, sample):
if '\uf03a' in pair:
pair1, pair2 = pair.split('\uf03a')
pair2 = pair1.split('_')[0] + '_' + pair2
sample_id = sample.split('_')[2]
return pair1 + '--' + pair2, pair1 + '--' + pair2 + '_' + sample_id
elif len(pair.split('_')) == 3:
base, pair1, pair2 = pair.split('_')
pair1 = base + '_' + pair1
pair2 = base + '_' + pair2
sample_id = sample.split('_')[3]
return pair1 + '--' + pair2, pair1 + '--' + pair2 + '_' + sample_id
else:
raise ValueError('Dockground utils: problem with pair {}'.format(pair))
def select_Dockground_split(folder_path, dataset):
groups_txt_path = os.path.join(folder_path, 'groups.txt')
with open(groups_txt_path) as f:
data = f.read()
data = data.split('=')[1].replace(' ', '').strip().upper()
split_dict = ast.literal_eval(data)
splits = [[] for f in range(len(split_dict))]
for d in dataset:
pair1, pair2 = d.pair.split('--')
for key_fold, values in enumerate(split_dict.values()): # Use enumerate() to assure presence of correct keys (i.e. from 0 to len(split_dict) - 1)
for v in values:
if v in pair1.upper() and v in pair2.upper():
splits[key_fold].append(d)
elif v not in pair1.upper() and v not in pair2.upper():
continue
else:
raise ValueError('Dockground utils: problem with pair {}'.format(d.pair))
predefined_splits = [{'train':[], 'val':[]} for f in range(len(split_dict))]
for k in range(len(split_dict)):
predefined_splits[k]['val'] = splits[k]
for fold in range(len(split_dict)):
if fold != k:
predefined_splits[k]['train'] += splits[fold]
return predefined_splits
\ No newline at end of file
import numpy as np
import pandas as pd
class ScoreDataframe():
# Class to extract scores and relevant metrics from a dataframe
def __init__(self):
self.df = pd.DataFrame({'conformation_name': [], 'fold': [], 'empty_cell_1': [], 'empty_cell_2': [], 'score': [], 'time_of_prediction': [], 'target': [], 'epoch': []})
self.df = self.df.astype({'conformation_name': 'str', 'fold': 'int', 'score': 'float', 'time_of_prediction': 'float', 'target': 'int', 'epoch': 'int'})
def add_row(self, conformation_name, fold, score, time_of_prediction, target, epoch):
# Add one result to the dataframe
if score < 0.0 or score > 1.0:
raise ValueError('Score must be between 0 and 1')
self.df = pd.concat([self.df, {'conformation_name': conformation_name, 'fold': fold, 'score': score, 'time_of_prediction': time_of_prediction, 'target': target, 'epoch': epoch}], ignore_index=True)
def add_rows(self, conformation_names, folds, scores, time_of_predictions, targets, epoch):
# Add multiple results to the dataframe
if len(conformation_names) != len(folds) or len(conformation_names) != len(scores) or len(conformation_names) != len(time_of_predictions) or len(conformation_names) != len(targets):
raise ValueError('All lists must be of the same length')
news = {'conformation_name': conformation_names, 'fold': folds, 'score': scores, 'time_of_prediction': time_of_predictions, 'target': targets, 'epoch': [epoch for _ in range(len(conformation_names))]}
new_df = pd.DataFrame(news)
self.df = pd.concat([self.df, new_df], ignore_index=True)
def export(self, savepath):
# Export the dataframe to a csv file
self.df.to_csv(savepath, header=True)
from copy import Error
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, GATConv, Sequential, global_mean_pool, BatchNorm
from torch.nn import Linear, Dropout
from torch_geometric.data import Data
from sklearn.utils.class_weight import compute_class_weight
import collections
# NamedTuple for storing the output of the model
Prediction = collections.namedtuple('Prediction', ['loss', 'pred', 'gt', 'correct'])
class GCN(torch.nn.Module):
def __init__(self,
dim,
activation_function=nn.ELU,
final_activation=nn.Softmax(dim=1),
loss_function=F.cross_entropy,
**kwargs):
super(GCN, self).__init__()
torch.manual_seed(12345)
# self.dim_atoms = kwargs.get('dim_atoms')
# self.dim_bonds = kwargs.get('dim_bonds')
# size of input/output node feature dimension, self-loops and symmetric normalization
self.in_dim = dim
self.loss_function = loss_function
self.final_activation = final_activation
self.define_architecture(dim)
def define_architecture(self, dim):
if dim > 16:
self.model = Sequential('x, edge_index, batch', [
(BatchNorm(in_channels=dim), 'x -> x'),
(GCNConv(dim, 100), 'x, edge_index -> x'),
nn.ELU(inplace=True),
(global_mean_pool, 'x, batch -> x'),
(BatchNorm(in_channels=100), 'x -> x'),
(Dropout(p=0.5), 'x -> x'),
(Linear(100, 20), 'x -> x'),
nn.ELU(inplace=True),
(BatchNorm(in_channels=20), 'x -> x'),
(Dropout(p=0.2), 'x -> x'),
(Linear(20, 2), 'x -> x'),
]
)
else:
self.model = Sequential('x, edge_index, batch', [
(BatchNorm(in_channels=dim), 'x -> x'),
(GCNConv(dim, 20), 'x, edge_index -> x'),
nn.ELU(inplace=True),
(global_mean_pool, 'x, batch -> x'),
(BatchNorm(in_channels=20), 'x -> x'),
(Dropout(p=0.2), 'x -> x'),
(Linear(20, 2), 'x -> x'),
]
)
def reinit_gcn(self, dim,
activation_function=nn.ELU,
final_activation=nn.Softmax(dim=1),
loss_function=F.cross_entropy,
**kwargs):
# Use it to redefine a network without restarting the pipeline
return GCN.__new__(dim, activation_function, final_activation, loss_function, **kwargs)
def forward(self, data):
x, edge_index, batch = data.x, data.edge_index, data.batch
return self.model(x, edge_index, batch)
# data of one batch as input of NN -> calculate loss
def trainGraph(self, batch, optimizer, device=torch.device('cuda'), weights=None):
# Move data to GPU if available
batch = batch.to(device)
self.train()
# Clear gradients
optimizer.zero_grad()
# Forward pass
out = self(batch)
target = batch.y
# Calculate loss
loss = F.cross_entropy(out, target, reduction='mean', weight=weights) # combines log_softmax and nll_loss
# Backward pass
loss.backward() # derive gradients
optimizer.step() # update weights
# Get final prediction
out = self.final_activation(out) # Use Softmax to get probability
prediction = out.argmax(dim=1) # Use the class with highest probability.
# Calculate accuracy
correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
return Prediction(loss, out, batch.y, correct)
@torch.no_grad()
def predictGraph(self, batch, device=torch.device('cuda'), weights=None):
# Move data to GPU if available
batch = batch.to(device)
self.eval()
# Forward pass
out = self(batch)
target = batch.y
# Calculate loss
loss = F.cross_entropy(out, target, reduction='mean', weight=weights)
# Get final prediction
out = self.final_activation(out) # Use Softmax to get probability
prediction = out.argmax(dim=1) # Use the class with highest probability.
# Calculate accuracy
correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
return Prediction(loss, out, batch.y, correct)
class LinearNetwork(torch.nn.Module):
def __init__(self,
dim,
architecture_type=None, # Use it for AutoGL
activation_function=nn.ELU,
final_activation=nn.Softmax(dim=1),
loss_function=F.cross_entropy,
**kwargs):
super(LinearNetwork, self).__init__()
torch.manual_seed(12345)
# self.dim_atoms = kwargs.get('dim_atoms')
# self.dim_bonds = kwargs.get('dim_bonds')
# size of input/output node feature dimension, self-loops and symmetric normalization
self.in_dim = dim
self.loss_function = loss_function
if dim > 21:
self.model = Sequential('x, edge_index, batch', [
(global_mean_pool, 'x, batch -> x'),
(BatchNorm(in_channels=dim), 'x -> x'),
(Linear(dim, 20), 'x -> x'),
(nn.ELU(inplace=True), 'x -> x'),
(BatchNorm(in_channels=20), 'x -> x'),
(Linear(20, 1), 'x -> x'),
(nn.Sigmoid(), 'x -> x'),
]
)
else:
self.model = Sequential('x, edge_index, batch', [
(global_mean_pool, 'x, batch -> x'),
(BatchNorm(in_channels=dim), 'x -> x'),
(Linear(dim, 1), 'x -> x'),
(nn.Sigmoid(), 'x -> x'),
]
)
def forward(self, data):
x, edge_index, batch = data.x, data.edge_index, data.batch
return self.model(x, edge_index, batch)
# data of one batch as input of NN -> calculate loss
def trainGraph(self, batch, optimizer, device=torch.device('cuda'), weights=None):
# Move data to GPU if available
batch = batch.to(device)
self.train()
# Clear gradients
optimizer.zero_grad()
# Forward pass
out = self(batch)
target = batch.y
if out.shape[1] != 1:
raise ValueError("Expected shape of probabilities should be (N, 1)")
out = out[:, 0]
# Calculate loss
loss = F.binary_cross_entropy(out, target.float(), reduction='mean', weight=weights)
# Backward pass
loss.backward() # derive gradients
optimizer.step() # update params based on gradients
# Get final prediction
prediction = [1 if p >= 0.5 else 0 for p in out]
# Calculate accuracy
correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
return Prediction(loss, out, batch.y, correct)
@torch.no_grad()
def predictGraph(self, batch, device=torch.device('cuda'), weights=None):
# Move data to GPU if available
batch = batch.to(device)
self.eval()
# Forward pass
out = self(batch)
target = batch.y
if out.shape[1] != 1:
raise ValueError("Expected shape of probabilities should be (N, 1)")
out = out[:, 0]
# Calculate loss
loss = F.binary_cross_entropy(out, target.float(), reduction='mean', weight=weights)
# Get final prediction
prediction = [1 if p >= 0.5 else 0 for p in out]
# Calculate accuracy
correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
return Prediction(loss, out, batch.y, correct)
class GAT(torch.nn.Module):
def __init__(self,
dim,
activation_function=nn.ELU,
final_activation=nn.Softmax(dim=1),
loss_function=F.cross_entropy,
**kwargs):
super(GAT, self).__init__()
torch.manual_seed(12345)
self.in_dim = dim
self.loss_function = loss_function
self.final_activation = final_activation
self.define_architecture(dim)
def define_architecture(self, dim):
if dim > 16:
self.model = Sequential('x, edge_index, batch', [
(BatchNorm(in_channels=dim), 'x -> x'),
(GATConv(dim, 100), 'x, edge_index -> x'),
nn.ELU(inplace=True),
(global_mean_pool, 'x, batch -> x'),
(BatchNorm(in_channels=100), 'x -> x'),
(Dropout(p=0.5), 'x -> x'),
(Linear(100, 20), 'x -> x'),
nn.ELU(inplace=True),
(BatchNorm(in_channels=20), 'x -> x'),
(Dropout(p=0.2), 'x -> x'),
(Linear(20, 2), 'x -> x'),
]
)
else:
self.model = Sequential('x, edge_index, batch', [
(BatchNorm(in_channels=dim), 'x -> x'),
(GATConv(dim, 20), 'x, edge_index -> x'),
nn.ELU(inplace=True),
(global_mean_pool, 'x, batch -> x'),
(BatchNorm(in_channels=20), 'x -> x'),
(Dropout(p=0.2), 'x -> x'),
(Linear(20, 2), 'x -> x'),
]
)
def reinit_gat(self, dim,
activation_function=nn.ELU,
final_activation=nn.Softmax(dim=1),
loss_function=F.cross_entropy,
**kwargs):
# Use it to redefine a network without restarting the pipeline
return GAT.__new__(dim, activation_function, final_activation, loss_function, **kwargs)
def forward(self, data):
x, edge_index, batch = data.x, data.edge_index, data.batch
return self.model(x, edge_index, batch)
def trainGraph(self, batch, optimizer, device=torch.device('cuda'), weights=None):
# Move data to GPU if available
batch = batch.to(device)
self.train()
# Clear gradients
optimizer.zero_grad() # clear gradients
# Forward pass
out = self(batch)
target = batch.y
# Calculate loss
loss = F.cross_entropy(out, target, reduction='mean', weight=weights)
# Backward pass
loss.backward() # derive gradients
optimizer.step() # update weights
# Get final prediction
out = self.final_activation(out) # Use Softmax to get probability
prediction = out.argmax(dim=1) # Use the class with highest probability.
# Calculate accuracy
correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
return Prediction(loss, out, batch.y, correct)
@torch.no_grad()
def predictGraph(self, batch, device=torch.device('cuda'), weights=None):
# Move data to GPU if available
batch = batch.to(device)
self.eval()
# Forward pass
out = self(batch)
target = batch.y
# Calculate loss
loss = F.cross_entropy(out, target, reduction='mean', weight=weights)
# Get final prediction
out = self.final_activation(out) # Use Softmax to get probability
prediction = out.argmax(dim=1) # Use the class with highest probability.
# Calculate accuracy
correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
return Prediction(loss, out, batch.y, correct)
import numpy as np
import torch
import json, os
import lz4.frame
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics as skmetrics
from sklearn.metrics import RocCurveDisplay
import shelve
from src.model import Prediction
def as_numpy(values):
# Convert a Tensor or a list of Tensors to numpy structures
if type(values) is torch.Tensor:
return values.cpu().detach().numpy()
elif type(values) is list or type(values) is tuple:
output = [0 for i in range(len(values))]
for i in range(len(values)):
if type(values[i]) is torch.Tensor:
output[i] = values[i].cpu().detach().numpy()
else:
output[i] = values[i]
return output
else:
return values
def as_numpy_tuple(pred:Prediction):
# Pass Prediction named tuple from the model to Prediction named tuple made of numpy arrays
return Prediction(as_numpy(pred.loss), as_numpy(pred.pred), as_numpy(pred.gt), as_numpy(pred.correct))
def save_with_compression(obj, path, compression_level=lz4.frame.COMPRESSIONLEVEL_MINHC):
# Save a python object to a compressed file
with lz4.frame.open(path, mode='wb') as fp:
obj_bytes = json.dumps(obj).encode('utf-8')
compressed_obj = lz4.frame.compress(obj_bytes, compression_level=compression_level)
fp.write(compressed_obj)
def load_with_compression(path):
# Load a python object from a compressed file
with lz4.frame.open(path, mode='r') as fp:
output_compressed_data = fp.read()
obj_bytes = lz4.frame.decompress(output_compressed_data)
obj = json.loads(obj_bytes.decode('utf-8'))
return obj
def euclidian_distance(coords_from, coords_to):
# Compute the euclidian distance between two points
if len(coords_from) != len(coords_to):
raise ValueError('Euclidian distance: coords don\'t have the same shape')
else:
return np.sqrt(np.array([np.power(i-j, 2) for i,j in zip(coords_from, coords_to)]).sum())
def plot_roc_pr_curves(metrics, savepath, plot_roc_lines=False):
# Plot meaningful ROC and PR curves
if not plot_roc_lines:
if len(np.unique(np.array([len(a) for a in metrics['tprs']]))) != 1:
raise ValueError("Error while plotting ROC curves")
# Plot ROC Curve...
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Plot ROC baseline
ax1.plot([0, 1], [0, 1], linestyle="--", lw=2, color="r", label="Chance", alpha=0.8)
if plot_roc_lines: # Plot each ROC curve (one per fold)
for fpr,tpr,auc in zip(metrics['fprs'], metrics['tprs'], metrics['aucs']):
ax1.plot(
fpr,
tpr,
label=r"Mean ROC (AUC = %0.2f)" % (auc),
lw=1,
alpha=0.8,
)
else:
# Plot mean ROC curve (one curve for all folds)
# Area between best and worst ROC curves are colored in grey
mean_tpr = np.mean(metrics['tprs'], axis=0)
mean_tpr[-1] = 1.0
mean_auc = skmetrics.auc(metrics['base_fpr'], mean_tpr)
std_auc = np.std(metrics['aucs'])
ax1.plot(
metrics['base_fpr'],
mean_tpr,
color="b",
label=r"Mean ROC (AUC = %0.2f $\pm$ %0.2f)" % (mean_auc, std_auc),
lw=2,
alpha=0.8,
)
std_tpr = np.std(metrics['tprs'], axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
ax1.fill_between(
metrics['base_fpr'],
tprs_lower,
tprs_upper,
color="grey",
alpha=0.2,
label=r"$\pm$ 1 std. dev.",
)
# Label, legend and set axe limits
ax1.set(
xlim=[-0.05, 1.05],
ylim=[-0.05, 1.05],
title="Receiver Operating Characteristic (ROC) Curve",
xlabel='False Positive Rate',
ylabel='True Positive Rate'
)
ax1.legend(loc="lower right")
# Plot PR Curve...
if len(metrics['precisions']) != len(metrics['recalls']):
raise ValueError("Error while computing prediction / recall")
else:
nb_folds = len(metrics['precisions'])
# Plot each PR curve (one per fold)
for p,r,i in zip(metrics['precisions'], metrics['recalls'], range(nb_folds)):
ax2.plot(
r,
p,
label="fold {}".format(i),
lw=1
)
# Label, legend and set axe limits
ax2.legend(loc="lower right")
ax2.set(
xlim=[-0.05, 1.05],
ylim=[-0.05, 1.05],
title="Precision Recall Curve",
xlabel='Recall',
ylabel='Precision'
)
# Save figure
plt.savefig(savepath)
plt.close()
def save_session(shelf_path):
# Save the current session
my_shelf = shelve.open(shelf_path,'n')
for key in globals().keys():
try:
my_shelf[key] = globals()[key]
except:
pass
my_shelf.close()
def load_session(shelf_path):
# Load a session
my_shelf = shelve.open(shelf_path)
for key in my_shelf:
try:
globals()[key]=my_shelf[key]
except:
print('Not loaded:', key)
pass
import sys, os
sys.path.insert(0, os.path.abspath(os.getcwd()))
import numpy as np
import torch
import matplotlib.pyplot as plt
import pandas as pd
import torch_geometric.data as geom_data
from src.data import ProteinDataset, StratifiedSplit, kFoldStratifiedCrossValidation, reduce_split
from torch_geometric.data import Data, Dataset, Batch, DataLoader
from src.model import GCN, GAT
from src.utils import as_numpy, as_numpy_tuple, plot_roc_pr_curves
from datetime import datetime
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_fscore_support, precision_recall_curve
from sklearn.metrics import PrecisionRecallDisplay, RocCurveDisplay
from configparser import ConfigParser
from src.config import str_to_bool, str_to_model, read_config
import unittest
class TestStratifiedSplit(unittest.TestCase):
def testIndependance(self):
data_dir = './data/intermediate_m5_e30_nonorm_split1/'
contains_pairs = True
skiptest = False
use_kfold = True
nb_folds = 5
seed = 123 # Seed to fix the simulation
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Build dataset
dataset = ProteinDataset(data_dir, random_seed=seed, device=device, nb_features=22)
dataset.reduce(nb_negative=200, nb_positive=200)
dataset_train, dataset_test = StratifiedSplit(dataset, first_partition=0.85, second_partition=0.15, shuffle=True)
# dataset_test = reduce_split(dataset_test, nb_positive=100, nb_negative=100)
train_len = (len(dataset_train)*(nb_folds -1)) / nb_folds
val_len = (len(dataset_train)) / nb_folds
test_len = len(dataset_test)
print()
print('Training on {} proteins.'.format(train_len))
print('Validating on {} proteins.'.format(val_len))
print('Testing on {} proteins.'.format(test_len))
kFoldStrCV = kFoldStratifiedCrossValidation(dataset_train, nb_folds=nb_folds, nb_classes=2, shuffle=True)
iter_kFold = iter(kFoldStrCV)
range_fold = range(nb_folds)
# Set tests
all_train_pairs = [d.pair for d in dataset_train]
test_pairs = [d.pair for d in dataset_test]
for i_p in range(len(test_pairs)):
with self.subTest(msg='Pair {}'.format(test_pairs[i_p])):
self.assertNotIn(test_pairs[i_p], all_train_pairs)
for k in range_fold:
train_set, val_set = iter_kFold.__next__()
train_pairs = [d.pair for d in train_set]
val_pairs = [d.pair for d in val_set]
for i_p in range(len(val_pairs)):
with self.subTest(msg='Fold {} - Pair {}'.format(k, val_pairs[i_p])):
self.assertNotIn(val_pairs[i_p], train_pairs)
if __name__ == '__main__':
unittest.main()
![](Images/method5.svg.png?raw=true "DLA-Ranker")
\ No newline at end of file
### Contents
- [Overview](#overview)
- [Requirements](#Requirements)
- [Quick Tutorial](#Tutorial)
- [Documentation](https://deeprank.readthedocs.io/)
- [License](./LICENSE)
- [Issues & Contributing](#Issues-and-Contributing)
## Overview
![](Images/method5.svg.png?raw=true "DLA-Ranker")
Deep Local Analysis (DLA)-Ranker is a deep learning framework applying 3D convolutions to a set of locally oriented cubes representing the protein interface. It explicitly considers the local geometry of
the interfacial residues along with their neighboring atoms and the regions of the interface with different solvent accessibility. DLA-Ranker identifies near-native conformations and discovers alternative interfaces from ensembles generated by molecular docking.
#### Features:
- Useful APIs for fast preprocessing of huge assembly of the complex conformations and classify them based on CAPRI criteria.
- Representation of an interface as a set of locally oriented cubes.
- *Atomic density map as a 3D gird.*
- *Structure class based on solvant accessibility (Support, Core, Rim).*
- *Information on Receptor and Ligand.*
- *Information of interfacial residues.*
- Classification of docking conformations based on CAPRI criteria (Incorrect, Acceptable, Medium, High quality)
- Fast generation of cubes and and evaluation of interface.
- Training and testing 3D-CNN models.
- Support various per-score aggregation schemes.
- *Considering only subset cubes for evaluation of interface.*
- *Residues from Support or Core regions.*
- *Residues from Core or Rim regions.*
- *Selecting residues exclusively from the receptor or from the ligand.*
- Extraction of embeddings and the topology of the interface for graph representation learning.
## Requirements
#### Packages:
DLA-Ranker can be run on Linux, MacOS, and Windows. We recommend to use DLA-Ranker on the machines with GPU. It requires following packages:
- Python version 3.7 or 3.8.
- Tensorflow version 2.2 or 2.3.
- Cuda-Toolkit
- Scikit-Learn, numpy pandas matplotlib lz4 and tqdm (conda install -c pytorch -c pyg -c conda-forge python=3.9 numpy pandas matplotlib tqdm pytorch pyg scikit-learn cuda-toolkit lz4).
All-in-one: Run conda create --name dla-ranker --file dla-ranker.yml
- For requirements of InteractionGNN please visit its Readme.
## Tutorial
DLA-Ranker works in two steps:
- Generating a set of locally orient cubes representing the interface.
- Running the deep learning framework to:
- *Train: creating a new model.*
- *Test: Evaluating conformations using trained models.*
- *Encode: Extracting embeddings and the topology of the interface.*
### Generating locally oriented cubes
#### Dataset of conformations:
Place the ensemble of conformations in a directory (*e.g. 'Examples/conformations_directory'*) like below:
```
Example
|___conformations_list
|
|___conformations_directory
|
|___target complex 1
| | Conformation 1
| | Conformation 2
| | ...
|
|___target complex 2
| | Conformation 1
| | Conformation 2
| | ...
|
..........
```
'conformations_list' is a csv file that contains five columns separated by ';': Name of target complex (`Comp`); receptor chain ID(s) (`ch1`), ligand chain ID(s) (`ch2`); Name of the conformation file (`Conf`); class of the conformation (`Class`, 0:incorrect, 1: near-native).
#### Processing the conformations
From directory 'Representation' run: ```python generate_cubes.py```
The output will be directory 'map_dir' with the following structure:
```
Example
|___map_dir
|___target complex 1
| |___0
| | | conformation 1
| | | conformation 2
| |
| |___1
| | conformation 3
| | conformation 4
|
|___target complex 2
| |___0
| | | conformation 1
| | | conformation 2
| |
| |___1
| | conformation 3
| | conformation 4
..........
```
Each output represents interface of a conformation and contains a set of local environments (*e.g. atomic density map, structure classes (S,C,R), topology of the interface, ...*)
### Deep learning framework
Following commands will use the trained models that can be found in the directory 'Models'. This directory includes 3 sets of models:
'BM5': 10 models generated following 10-fold cross validation procedure on the 142 dimers of the Docking Benchmakr version 5. The docking conformations had been generated by HADDOCK.
'Dockground': 4 models generated following 4-fold cross validation procedure on the 59 target complexes of the Dockground database. The docking conformations had been generated by GRAMM.
'CCD4PPI': 5 models generated following 5-fold cross validation procedure on the 400 target complexes.
For detailed information please read the article.
#### Evaluation of interfaces
From directory 'Test' run ```python test.py```
It processes all the target complexes and their conformations and produces csv file 'predictions_SCR'. Each row of the output file belongs to a conformation and it has 9 columns separated by 'tab':
Name of target complex and the conformation (`Conf`)
Fold Id (`Fold`)
Score of each residue (`Scores`)
Region (SCR) of each residue (`Regions`)
Global averaged score of the interface (`Score`)
Processing time (`Time`)
Class of the conformation (`Class`, 0:incorrect, 1: near-native)
Partner (`RecLig`)
Residue number (`ResNumber`; according to PDB)
One can associate the Residues' numbers, regions, scores, and partner to evaluate the interface on a subset of interfacial residues.
#### Extraction of the embeddings
From directory 'Test' run ```python extract_embeddings.py```
It extracts embeddings and the topology for each given interface and write them in a an output file with the same name. Each row in a file belongs to a residue and includes the its coordinates, its region, and its embedding vector. These files can be used for aggregation of embeddings based on graph-learning.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Jan 2 20:21:06 2022
@author: yasser
"""
import logging
import os
import sys
from os import path, mkdir, getenv, listdir, remove, system, stat
import pandas as pd
import numpy as np
from prody import *
import glob
import shutil
#import matplotlib.pyplot as plt
import seaborn as sns
from math import exp
from subprocess import CalledProcessError, check_call, call
import traceback
from random import shuffle, random, seed, sample
from numpy import newaxis
import matplotlib.pyplot as plt
import time
from prody import *
import collections
import scr
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
import subprocess
import deepScoring.load_data as load
from sklearn.preprocessing import MinMaxScaler
logging.basicConfig(filename='manager.log', filemode='w', format='%(levelname)s: %(message)s', level=logging.DEBUG)
mainlog = logging.getLogger('main')
logging.Logger
sys.path.insert(1, '../lib/')
import tools as tl
confom_dict = pd.read_csv('../Examples/conformations_list', sep=';')
comp_dir = '../Examples/conformations_directory'
target_comp = listdir(comp_dir)
map_dir = '../Examples/map_dir'
if not path.exists(map_dir):
mkdir(map_dir)
bin_path = "./mapsGenerator/build/maps_generator"
v_dim = 24
def mapcomplex(file, pose_class, ch1, ch2, pair, pose):
try:
name = pair+'_'+str(pose)
rec = parsePDB(file).select('protein').select('chain ' + ch1[0])
rec.setChids('R')
lig = parsePDB(file).select('protein').select('chain ' + ch2[0])
lig.setChids('L')
writePDB(name+'_r.pdb', rec.toAtomGroup())
writePDB(name+'_l.pdb', lig.toAtomGroup())
writePDB(name+'_complex.pdb', rec.toAtomGroup() + lig.toAtomGroup())
scr.get_scr(name+'_r.pdb', name+'_l.pdb', name+'_complex.pdb', name)
rimcoresup = pd.read_csv(name+'_rimcoresup.csv', header=None, sep=' ')
rec_regions = rimcoresup.loc[rimcoresup[4] == 'receptor']
rec_regions = pd.Series(rec_regions[5].values, index=rec_regions[2]).to_dict()
lig_regions = rimcoresup.loc[rimcoresup[4] == 'ligand']
lig_regions = pd.Series(lig_regions[5].values, index=lig_regions[2]).to_dict()
res_num2name_map_rec = dict(zip(rec.getResnums(),rec.getResnames()))
res_num2name_map_lig = dict(zip(lig.getResnums(),lig.getResnames()))
res_num2coord_map_rec = dict(zip(rec.select('ca').getResnums(),rec.select('ca').getCoords()))
res_num2coord_map_lig = dict(zip(lig.select('ca').getResnums(),lig.select('ca').getCoords()))
L1 = list(set(rec.getResnums()))
res_ind_map_rec = dict([(x,inx) for inx, x in enumerate(sorted(L1))])
L1 = list(set(lig.getResnums()))
res_ind_map_lig = dict([(x,inx+len(res_ind_map_rec)) for inx, x in enumerate(sorted(L1))])
res_inter_rec = [(res_ind_map_rec[x], rec_regions[x], x, 'R', res_num2name_map_rec[x], res_num2coord_map_rec[x])
for x in sorted(list(rec_regions.keys())) if x in res_ind_map_rec]
res_inter_lig = [(res_ind_map_lig[x], lig_regions[x], x, 'L', res_num2name_map_lig[x], res_num2coord_map_lig[x])
for x in sorted(list(lig_regions.keys())) if x in res_ind_map_lig]
reg_type = list(map(lambda x: x[1],res_inter_rec)) + list(map(lambda x: x[1],res_inter_lig))
res_name = list(map(lambda x: [x[4]],res_inter_rec)) + list(map(lambda x: [x[4]],res_inter_lig))
res_pos = list(map(lambda x: x[5],res_inter_rec)) + list(map(lambda x: x[5],res_inter_lig))
#Merge these two files!
with open('resinfo','w') as fh_res:
for x in res_inter_rec:
fh_res.write(str(x[2])+';'+x[3]+'\n')
for x in res_inter_lig:
fh_res.write(str(x[2])+';'+x[3]+'\n')
with open('scrinfo','w') as fh_csr:
for x in res_inter_rec:
fh_csr.write(str(x[2])+';'+x[3]+';'+x[1]+'\n')
for x in res_inter_lig:
fh_csr.write(str(x[2])+';'+x[3]+';'+x[1]+'\n')
if not res_inter_rec or not res_inter_lig:
return [],[],[]
#tl.coarse_grain_pdb('train.pdb')
mapcommand = [bin_path, "--mode", "map", "-i", name+'_complex.pdb', "--native", "-m", str(v_dim), "-t", "167", "-v", "0.8", "-o", name+'_complex.bin']
call(mapcommand)
dataset_train = load.read_data_set(name+'_complex.bin')
print(dataset_train.maps.shape)
#scaler = MinMaxScaler()
#scaler.fit(dataset_train.maps)
#data_norm = scaler.transform(dataset_train.maps)
data_norm = dataset_train.maps
X = np.reshape(data_norm, (-1,v_dim,v_dim,v_dim,173))
y = [int(pose_class)]*(len(res_inter_rec) + len(res_inter_lig))
map_name = path.join(map_dir, pair, pose_class, name)
tl.save_obj((X,y,reg_type,res_pos,res_name,res_inter_rec+res_inter_lig), map_name)
check_call(
[
'lz4', '-f', #, '--rm' because if inconsistency in lz4 versions!
map_name + '.pkl'
],
stdout=sys.stdout)
remove(map_name + '.pkl')
remove(name+'_complex.bin')
remove(name+'_r.pdb')
remove(name+'_l.pdb')
remove(name+'_complex.pdb')
remove(name+'_rimcoresup.csv')
print(type(X))
print(X.shape)
except Exception as e:
logging.info("Bad interface!" + '\nError message: ' + str(e) +
"\nMore information:\n" + traceback.format_exc())
return [],[],[]
return X, y, reg_type
already_exist = listdir(map_dir)
def process_targetcomplex(targetcomplex, comp_dir, report_dict):
try:
if targetcomplex in already_exist:
return
logging.info('Processing ' + targetcomplex + ' ...')
pos_path = path.join(map_dir, targetcomplex, '1')
neg_path = path.join(map_dir, targetcomplex, '0')
if not path.exists(path.join(map_dir, targetcomplex)):
mkdir(path.join(map_dir, targetcomplex))
if not path.exists(pos_path):
mkdir(pos_path)
if not path.exists(neg_path):
mkdir(neg_path)
good_poses = confom_dict.loc[(confom_dict.Comp == targetcomplex) & (confom_dict.Class == 1)].Conf.to_list()
bad_poses = confom_dict.loc[(confom_dict.Comp == targetcomplex) & (confom_dict.Class == 0)].Conf.to_list()
ch1, ch2 = confom_dict.loc[confom_dict.Comp == targetcomplex][['ch1','ch2']].iloc[0]
for pose in good_poses:
file = path.join(comp_dir, pose + '.pdb')
mapcomplex(file, '1', ch1, ch2, targetcomplex, path.basename(pose))
for pose in bad_poses:
file = path.join(comp_dir, pose)
mapcomplex(file, '0', ch1, ch2, targetcomplex, path.basename(pose))
except Exception as e:
logging.info("Bad target complex!" + '\nError message: ' + str(e) +
"\nMore information:\n" + traceback.format_exc())
def manage_pair_files(use_multiprocessing):
tc_cases = []
for tc in target_comp:
tc_cases.append((tc, comp_dir))
report_dict = tl.do_processing(tc_cases, process_targetcomplex, use_multiprocessing)
return report_dict
report_dict = manage_pair_files(False)
\ No newline at end of file
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 13 21:09:56 2020
@author: yasser
"""
import logging
import os
import sys
import gc
from os import path, mkdir, getenv, listdir, remove, system, stat
import pandas as pd
import numpy as np
#from prody import *
import glob
import shutil
#import matplotlib.pyplot as plt
import seaborn as sns
from math import exp
import subprocess
from subprocess import CalledProcessError, check_call
import traceback
from random import shuffle, random, seed, sample
from numpy import newaxis
import matplotlib.pyplot as plt
import time
import collections
#import scr
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
import tensorflow.keras
from tensorflow.keras import backend as K
from tensorflow.keras import regularizers
from tensorflow.keras.datasets import mnist # subroutines for fetching the MNIST dataset
from tensorflow.keras.models import Model, Sequential,load_model # basic class for specifying and training a neural network
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Dense, Dropout, Activation, Flatten, AveragePooling3D
#from tensorflow.keras.utils import np_utils # utilities for one-hot encoding of ground truth values
from tensorflow.keras.layers import Dot
from tensorflow.keras.backend import ones, ones_like
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score, roc_auc_score, roc_curve, precision_recall_curve
from sklearn.preprocessing import MinMaxScaler
import pickle
print('Your python version: {}'.format(sys.version_info.major))
USE_TENSORFLOW_AS_BACKEND = True
# IF YOU *DO* HAVE AN Nvidia GPU on your computer, or execute on Google COLAB, then change below to False!
FORCE_CPU = False #False
if USE_TENSORFLOW_AS_BACKEND:
os.environ['KERAS_BACKEND'] = 'tensorflow'
else:
os.environ['KERAS_BACKEND'] = 'theano'
if FORCE_CPU:
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = ""
if USE_TENSORFLOW_AS_BACKEND == True:
import tensorflow as tf
print('Your tensorflow version: {}'.format(tf.__version__))
print("GPU : "+tf.test.gpu_device_name())
physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
else:
import theano
print('Your theano version: {}'.format(theano.__version__))
logging.basicConfig(filename='manager.log', filemode='w', format='%(levelname)s: %(message)s', level=logging.DEBUG)
mainlog = logging.getLogger('main')
logging.Logger
seed(int(np.round(np.random.random()*10)))
#################################################################################################
def save_obj(obj, name):
with open(name + '.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
def load_obj(name):
with open(name + '.pkl', 'rb') as f:
return pickle.load(f)
v_dim = 24
map_dir = '../Examples/map_dir'
intermediate_dir = '../Examples/intermediate'
samples_test = listdir(map_dir)
if path.isdir(intermediate_dir):
shutil.rmtree(intermediate_dir)
mkdir(intermediate_dir)
model = load_model(path.join('../Examples', 'MODELS', 'Dockground', '0_model'))
#Must be update!
def load_map(train_path):
"""
check_call(
[
'lz4_win64_v1_9_3\lz4.exe', '-d', '-f',
train_path
],
stdout=sys.stdout)
"""
print(train_path)
#X_train, y_train, reg_type, _,_,_ = load_obj(train_path.replace('.pkl.lz4',''))
X_train, y_train, reg_type, res_pos,_,_ = load_obj(train_path.replace('.pkl',''))
#remove(train_path.replace('.lz4',''))
return X_train, y_train, reg_type, res_pos
batch_samples_test_1 = []
batch_samples_test_0 = []
for pair in samples_test:
batch_samples_test_1 += glob.glob(path.join(map_dir,pair,'1','*'))
batch_samples_test_0 += glob.glob(path.join(map_dir,pair,'0','*'))
mkdir(path.join(intermediate_dir, pair))
mkdir(path.join(intermediate_dir, pair, '1'))
mkdir(path.join(intermediate_dir, pair, '0'))
batch_samples_test = batch_samples_test_1 + batch_samples_test_0
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
onehot = encoder.fit(np.asarray([['S'], ['C'], ['R']]))
for test_interface in batch_samples_test:
try:
print(test_interface)
X_test, y_test, reg_type, res_pos = load_map(test_interface)
X_aux = encoder.transform(list(map(lambda x: [x], reg_type)))
if len(X_test) == 0 or len(X_aux) != len(X_test):
continue
except Exception as e:
logging.info("Bad interface!" + '\nError message: ' + str(e) +
"\nMore information:\n" + traceback.format_exc())
continue
intermediate_model = Model(inputs=model.input, outputs=model.get_layer('layer1').output)
intermediate_prediction = intermediate_model.predict([X_test, X_aux], batch_size=X_test.shape[0])
_ = gc.collect()
with open(test_interface.replace(map_dir, intermediate_dir).replace('.pkl','.graph'),'w') as f_handler_graph:
for i in range(len(X_test)):
f_handler_graph.write(','.join(list(map(lambda x: str(x), res_pos[i]))) + ',' +
reg_type[i] + ',' +
','.join(list(map(lambda x: str(x), intermediate_prediction[i]))) + '\n')
\ No newline at end of file
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 13 21:09:56 2020
@author: yasser
"""
import logging
import os
import sys
import gc
from os import path, mkdir, getenv, listdir, remove, system, stat
import pandas as pd
import numpy as np
import glob
import seaborn as sns
from math import exp
from subprocess import CalledProcessError, check_call
import traceback
from random import shuffle, random, seed, sample
from numpy import newaxis
import matplotlib.pyplot as plt
import time
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
import tensorflow.keras
from tensorflow.keras import backend as K
from tensorflow.keras.models import load_model
from tensorflow.keras.layers import Dot
from tensorflow.keras.backend import ones, ones_like
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score, roc_auc_score, roc_curve, precision_recall_curve
from sklearn.preprocessing import MinMaxScaler
import pickle
print('Your python version: {}'.format(sys.version_info.major))
USE_TENSORFLOW_AS_BACKEND = True
# IF YOU *DO* HAVE AN Nvidia GPU on your computer, or execute on Google COLAB, then change below to False!
FORCE_CPU = False #False
if USE_TENSORFLOW_AS_BACKEND:
os.environ['KERAS_BACKEND'] = 'tensorflow'
else:
os.environ['KERAS_BACKEND'] = 'theano'
if FORCE_CPU:
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = ""
if USE_TENSORFLOW_AS_BACKEND == True:
import tensorflow as tf
print('Your tensorflow version: {}'.format(tf.__version__))
print("GPU : "+tf.test.gpu_device_name())
physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
else:
import theano
print('Your theano version: {}'.format(theano.__version__))
logging.basicConfig(filename='manager.log', filemode='w', format='%(levelname)s: %(message)s', level=logging.DEBUG)
mainlog = logging.getLogger('main')
logging.Logger
seed(int(np.round(np.random.random()*10)))
#################################################################################################
def save_obj(obj, name):
with open(name + '.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
def load_obj(name):
with open(name + '.pkl', 'rb') as f:
return pickle.load(f)
v_dim = 24
map_dir = '../Examples/map_dir'
samples_test = listdir(map_dir)
#model = load_model(path.join('../Examples', 'MODELS', 'Dockground', '0_model'))
model = load_model(path.join('../Examples', 'MODELS', 'BM5', '0_model'))
predictions_file = open('../Examples/predictions_SCR', 'w')
#Must be update!
def load_map(train_path):
"""
check_call(
[
'lz4_win64_v1_9_3\lz4.exe', '-d', '-f',
train_path
],
stdout=sys.stdout)
"""
print(train_path)
#X_train, y_train, reg_type, _,_,_ = load_obj(train_path.replace('.pkl.lz4',''))
X_train, y_train, reg_type, res_pos,_,info = load_obj(train_path.replace('.pkl',''))
#X_train = X_train[:,:,:,:,:167]
"""
X_train_tmp = []
reg_type_tmp = []
y_train_tmp = []
for i in range(min(len(X_train),len(reg_type))):
if reg_type[i] == 'C' or reg_type[i] == 'R':
X_train_tmp.append(X_train[i])
reg_type_tmp.append(reg_type[i])
y_train_tmp.append(y_train[i])
reg_type = reg_type_tmp
X_train = np.array(X_train_tmp)
y_train = y_train_tmp
"""
#remove(train_path.replace('.lz4',''))
return X_train, y_train, reg_type, res_pos, info
fold = 'test'
batch_samples_test_1 = []
batch_samples_test_0 = []
for pair in samples_test:
batch_samples_test_1 += glob.glob(path.join(map_dir,pair,'1','*'))
batch_samples_test_0 += glob.glob(path.join(map_dir,pair,'0','*'))
batch_samples_test = batch_samples_test_1 + batch_samples_test_0
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
onehot = encoder.fit(np.asarray([['S'], ['C'], ['R']]))
predictions_file.write('Conf' + '\t' +
'Fold' + '\t' +
'Scores' + '\t' +
'Regions' + '\t' +
'Score' + '\t' +
'Time' + '\t' +
'Class' + '\t' +
'RecLig' + '\t' +
'ResNumber' + '\n')
for test_interface in batch_samples_test:
try:
print(test_interface)
X_test, y_test, reg_type, res_pos, info = load_map(test_interface)
X_aux = encoder.transform(list(map(lambda x: [x], reg_type)))
if len(X_test) == 0 or len(X_aux) != len(X_test):
continue
except Exception as e:
logging.info("Bad target complex!" + '\nError message: ' + str(e) +
"\nMore information:\n" + traceback.format_exc())
continue
start = time.time()
all_scores = model.predict([X_test, X_aux], batch_size=X_test.shape[0])
end = time.time()
_ = gc.collect()
test_preds = all_scores.mean()
print(y_test)
predictions_file.write(test_interface + '\t' +
str(fold) + '\t' +
','.join(list(map(lambda x: str(x[0]), all_scores))) + '\t' +
','.join(reg_type) + '\t' +
str(test_preds) + '\t' +
str(end-start) + '\t' +
str(y_test[0]) + '\t' +
','.join(list(map(lambda x: x[3], info))) + '\t' +
','.join(list(map(lambda x: str(x[2]), info))) + '\n')
predictions_file.close()
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment