Updates

cf8bc06b · DLA-Ranker · 08302f9a · cf8bc06b · cf8bc06b · cf8bc06b
Commit cf8bc06b authored Apr 03, 2022 by DLA-Ranker
143 changed files
--- a/DLA-Ranker.yml
+++ b/DLA-Ranker.yml
+name: null
+
+channels:
+  - pytorch
+  - conda-forge
+  - defaults
+
+dependencies:
+  - bokeh=1.4
+  - cmake=3.16 # insures that Gloo library extensions will be built
+  - cudnn=7.6
+  - cupti=10.1
+  - cxx-compiler=1.0 # insures C and C++ compilers are available
+  - jupyterlab=1.2
+  - mpi4py=3.0 # installs cuda-aware openmpi
+  - nccl=2.5
+  - nodejs=13
+  - nvcc_linux-64=10.1 # configures environment to be "cuda-aware"
+  - pip=20.0
+  - pip:
+    - mxnet-cu101mkl==1.6.* # MXNET is installed prior to horovod
+    - -r file:requirements.txt
+  - python=3.7
+  - pytorch=1.4
+  - tensorboard=2.1
+  - tensorflow-gpu=2.1
+  - torchvision=0.5
--- a/Examples/conformations_list
+++ b/Examples/conformations_list
+Comp;ch1;ch2;Conf;Class
+1AK4;A;B;1AK4_cm-it0_8585;0
+1AK4;A;B;1AK4_cm-it0_9733;0
+2I25;A;B;2I25_ti5-it1_46;1
--- a/InteractionGNN-main/.gitattributes
+++ b/InteractionGNN-main/.gitattributes
+# Auto detect text files and perform LF normalization
+* text=auto
--- a/InteractionGNN-main/.gitignore
+++ b/InteractionGNN-main/.gitignore
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
--- a/InteractionGNN-main/LICENSE
+++ b/InteractionGNN-main/LICENSE
+MIT License
+
+Copyright (c) 2022 Simon Crouzet
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/InteractionGNN-main/README.md
+++ b/InteractionGNN-main/README.md
+# InteractionGNN
+
+Source code to run InteractionGNN on protein-protein interfaces
+
+### Usage
+1. Set the desired parameters in *config.ini*
+2. in a bash shell, run `python interaction_gnn.py`
+
+### Requirements
+##### Packages:
+ - Manual: install Pytorch, Pytorch-Geometric, Cuda-Toolkit, Scikit-Learn and the packages numpy pandas matplotlib lz4 and tqdm (`conda install -c pytorch -c pyg -c conda-forge python=3.9 numpy pandas matplotlib tqdm pytorch pyg scikit-learn cuda-toolkit lz4`)
+ - All-in-one: Run `conda create --name interaction_gnn --file interaction_gnn.yml`
+\
+InteractionGNN is using [Pytorch-Geometric]([github.com/pyg-team/pytorch_geometric](https://github.com/pyg-team/pytorch_geometric)).
+
+##### Data files:
+Should be in the folder data, displayed like the following example for binary classification:
+\
+```
+InteractionGNN
+|   interaction_gnn.py
+|
+|___src
+|   |   ...
+|
+|___data
+    |___protein_pair_1
+    |   |___0
+    |   |   |   file1
+    |   |   |   file2
+    |   |
+    |   |___1
+    |       |   file3
+    |       |   file4
+    |   
+    |___protein_pair_2
+    |   |___0
+    |   |   |   file5
+    |   |   |   file6
+    |   |
+    |   |___1
+    |       |   file7
+    |       |   file8
+    ..........
+```
+
+### Citing
+If you use this code, please cite the associated paper:
+\
+```Y. Mohseni Behbahani, S. Crouzet, E. Laine, A. Carbone, *Deep Local Analysis evaluates protein docking conformations with locally oriented cubes*```
\ No newline at end of file
--- a/InteractionGNN-main/config.ini
+++ b/InteractionGNN-main/config.ini
+[RUNINFO]
+save_models = True
+data_dir = ./data/intermediate_m5_e30_nonorm_split1/
+
+[MODELINFO]
+nn_model_string = GCN
+learning_rate = 0.001
+nb_negative = -1
+nb_positive = -1
+nb_features = 22
+specific_scr_fea = S,C,R
+balance_test_split = True
+exclude_last = True
+
+[RUNPARAMS]
+use_kfold = True
+epochs = 100
+batch_size = 500
+nb_folds = 5
+use_weights = True
\ No newline at end of file
--- a/InteractionGNN-main/interaction_gnn.py
+++ b/InteractionGNN-main/interaction_gnn.py
+import copy
+import os
+import numpy as np
+import random
+import torch
+import matplotlib.pyplot as plt
+import pandas as pd
+import torch_geometric.data as geom_data
+import torch_geometric.loader as geom_loader
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn import Linear, Dropout
+from src.data import ProteinDataset, StratifiedSplit, kFoldStratifiedCrossValidation, reduce_split
+from torch_geometric.data import Data, Dataset, Batch, DataLoader
+from src.model import GCN, GAT
+from src.utils import as_numpy, as_numpy_tuple, plot_roc_pr_curves
+from src.metrics import ScoreDataframe
+from datetime import datetime
+from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve
+from src.config import str_to_bool, str_to_model, read_config
+from sklearn.utils.class_weight import compute_class_weight
+
+print('================= InteractionGNN =================')
+
+#Read config.ini file
+runinfo, modelinfo, runparams = read_config('config.ini')
+
+# parameters
+
+start_time = datetime.now()
+
+# Configuration, from the config.ini file
+NN_model_string = modelinfo["NN_model_string"]
+NN_model = str_to_model(NN_model_string)  # GCN or GAT or Linear
+save_models = str_to_bool(runinfo["save_models"])
+epochs = int(runparams["epochs"])
+batch_size = int(runparams["batch_size"])
+nb_folds = int(runparams["nb_folds"])
+model_no = runinfo["model_no"]  # id for results
+data_dir = runinfo["data_dir"]  # data directory
+nb_negative = int(modelinfo["nb_negative"])  # -1 means 'No limit for negative samples'
+nb_positive = int(modelinfo["nb_positive"])  # -1 means 'No limit for positive samples'
+use_kfold = str_to_bool(runparams["use_kfold"])
+nb_features = int(modelinfo["nb_features"])
+use_weights = str_to_bool(runparams["use_weights"])
+# Specific SCR regions should be of format 'S,C,R' in the config file
+specific_scr_fea = modelinfo['specific_scr_fea'].split(',')
+balance_test_split = str_to_bool(modelinfo["balance_test_split"])
+exclude_last = str_to_bool(modelinfo["exclude_last"])
+
+data_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), data_dir))
+if not os.path.exists('results'):
+    os.mkdir('results')
+print('Initializing dataset from {}'.format(data_dir))
+
+seed = 123  # Seed to fix the simulation
+random.seed(seed)
+lr = float(modelinfo["learning_rate"])  # Learning rate for optimizer
+
+# normalize train -> remember mean, std to use for val and test
+
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+
+# Build dataset
+dataset = ProteinDataset(data_dir, random_seed=seed, device=device, nb_features=nb_features, use_scr=True, specific_scr_fea=specific_scr_fea)
+
+# Reduce dataset to desired nb of samples
+dataset.reduce(nb_negative=100, nb_positive=100)
+
+print('Dataset built with {} samples - {} features and {} classes'.format(dataset.__len__(), dataset.nb_features, dataset.nb_classes))
+
+if dataset.nb_classes == 2:
+    print('\tDataset contains {} P and {} N'.format(dataset.nb_samples[1], dataset.nb_samples[0]))
+
+print('Loading Data...')
+
+# TODO: Test parallelism to speed up data loading
+
+# Creating sets
+if use_kfold:
+    print('\tCreating train/val/test sets with k-fold cross validation...')
+    dataset_train, dataset_test = StratifiedSplit(dataset, first_partition=0.85, second_partition=0.15, shuffle=True)
+    dataset_val = []
+    train_len = (len(dataset_train)*(nb_folds -1)) / nb_folds
+    val_len = (len(dataset_train)) / nb_folds
+    test_len = len(dataset_test)
+    # Create folds
+    kFoldStrCV = kFoldStratifiedCrossValidation(dataset_train, nb_folds=nb_folds, nb_classes=2, shuffle=True)
+    iter_kFold = iter(kFoldStrCV)
+    range_fold = range(nb_folds)
+else:
+    print('\tCreating train/val/test sets (without CV)...')
+    dataset_train, dataset_test_val = StratifiedSplit(dataset, first_partition=0.7, second_partition=0.3, shuffle=True)
+    dataset_test, dataset_val = StratifiedSplit(dataset_test_val, first_partition=0.5, second_partition=0.5, shuffle=True, in_memory=True, nb_classes=dataset.nb_classes)
+    del dataset_test_val  # Unused heavy data
+    train_len = len(dataset_train)
+    test_len = len(dataset_test)
+    val_len = len(dataset_val)
+    range_fold = range(1)
+
+assert(len(dataset_train)+len(dataset_val) > len(dataset_test)), 'Test set is surprisingly large!'
+
+# Balancing test split
+if balance_test_split:
+    print('\tBalancing test set...')
+    dataset_test = reduce_split(dataset_test, limits={0:100, 1:100}, nb_classes=dataset.nb_classes)
+    test_len = len(dataset_test)
+
+
+dim = dataset_train[0].x.shape[1]
+
+# Save pairs used in different sets
+max_length = len(dataset_train)+len(dataset_val)
+split_df = pd.DataFrame({'all_train':[d.pair for d in dataset_train+dataset_val],'test':[dataset_test[i].pair  if i < len(dataset_test) else '' for i in range(max_length)]})
+
+print('Splits created with {} train/{} val/{} test samples'.format(train_len, val_len, test_len))
+
+# create model
+# Set neural_network architecture - to be changed on the flight
+print('Instanciating Graph Neural Network architecture...')
+kwargs = {}
+base_model = NN_model(**kwargs, dim=dim)
+models = [copy.deepcopy(base_model).to(device) for _ in range_fold]
+print('\tModel {} instanciated on device {}: {}'.format(NN_model, device, models[0]))
+#model.init_weights(torch.nn.init.xavier_normal_, 0.1)
+
+# main loop
+hist = {"train_loss": np.array([0.0 for e in range(epochs)]), "val_loss": np.array([0.0 for e in range(epochs)]), "val_acc": np.array([0.0 for e in range(epochs)])}
+scores = ScoreDataframe()
+
+
+metrics_val = {'fprs':[], 'tprs':[], 'precisions':[], 'recalls':[], 'aucs':np.zeros(nb_folds)}
+for k in range_fold:
+    print('=========')
+    print(f'Fold {k + 1}:')
+    print('=========')
+    print()
+
+    optimizer = torch.optim.Adam(models[k].parameters(), lr=lr, weight_decay=5e-4)  # Define optimizer.
+
+    # Preparing test and val sets
+    if use_kfold:
+        train_set, val_set = iter_kFold.__next__()
+    else:
+        train_set, val_set = dataset_train, dataset_val
+
+    # Compute weights for each class. Since validation loss is only informative, use weights from train set only.
+    if use_weights:
+        weights = torch.Tensor(compute_class_weight('balanced', classes=[0,1], y=[d.y for d in train_set])).to(device)
+    else:
+        weights = torch.Tensor(compute_class_weight(None, classes=[0,1], y=[d.y for d in train_set])).to(device)
+
+    split_df.insert(len(split_df.columns), 'train_fold{}'.format(k), [train_set[i].pair if i < len(train_set) else '' for i in range(max_length)])
+    split_df.insert(len(split_df.columns), 'val_fold{}'.format(k), [val_set[i].pair if i < len(val_set) else '' for i in range(max_length)])
+
+    # Divide sets into batches
+    graph_train_loader = geom_loader.DataLoader(train_set, batch_size=batch_size, shuffle=True)
+    graph_val_loader = geom_loader.DataLoader(val_set, batch_size=batch_size, shuffle=True)  # Additional loader if you want to change to a larger dataset
+
+    # Running model
+    for epoch in range(epochs):
+        print('---------')
+        print(f'epoch {epoch + 1}:')
+
+        train_loss = 0.0
+        val_loss = 0.0
+        val_acc = 0.0
+        outputs, label = np.empty((0,dataset.nb_classes), dtype=np.float32), np.empty((0), dtype=np.int64) # for ROC PR plots of val dataset
+
+        # Train
+        for step, data in enumerate(graph_train_loader):
+            # Moving data to GPU memory
+            data = data.to(device)
+
+            # Forward pass
+            train_pred = as_numpy_tuple(models[k].trainGraph(data, optimizer, device, weights=weights))
+
+            train_loss += train_pred.loss
+
+            #Move back data to CPU and flush GPU memory
+            if device.type != 'cpu':
+                torch.cuda.empty_cache()
+
+            # if save_models:
+            #     scores.add_rows(conformation_names=data.pair, folds=[k for i in range(len(train_pred.gt))], scores=train_pred.pred[:, 1], time_of_predictions=[np.NaN for i in range(len(train_pred.gt))], targets=train_pred.gt, epoch=epoch)
+        
+        # Validation
+        for step, data in enumerate(graph_val_loader):
+            # Moving data to GPU memory
+            data = data.to(device)
+
+            # Forward pass
+            val_pred = as_numpy_tuple(models[k].predictGraph(data, device))
+
+            val_loss += val_pred.loss
+            val_acc += val_pred.correct
+
+            if save_models:
+                scores.add_rows(conformation_names=data.pair, folds=[k for i in range(len(val_pred.gt))], scores=val_pred.pred[:, 1], time_of_predictions=[np.NaN for i in range(len(val_pred.gt))], targets=val_pred.gt, epoch=epoch)
+
+            # data for plot  # TODO: Fix val plot for batch size > 1
+            outputs = np.concatenate((outputs, val_pred.pred), axis=0)
+            label = np.concatenate((label, val_pred.gt), axis=0)
+
+            # Move back data to CPU and flush GPU memory
+            if device.type != "cpu":
+                torch.cuda.empty_cache()
+        
+        # Little bias here: loss are divided by number of batches, therefore the last batch (if smaller than batch_size) had more importance than the others.
+        train_loss /= len(graph_train_loader)
+        val_loss /= len(graph_val_loader)
+
+        val_acc = val_acc / len(val_set)
+
+        print('Training loss: ', train_loss)
+        print('Val loss: ', val_loss)
+        print('Val accuracy: %4.2f%%' % (100.0*float(val_acc)))
+        print('---------')
+
+        hist['train_loss'][epoch] += train_loss
+        hist['val_loss'][epoch] += val_loss
+        hist['val_acc'][epoch] += val_acc
+
+        # save model
+        # if save_models:
+        #     torch.save(models[k], 'results/' + NN_model_string + '_' + model_no + '_model_epoch_' + str(epoch+1) + '.pt') # , PATH
+
+    if dataset.nb_classes == 2:
+        fpr, tpr, _ = roc_curve(y_true=label, y_score=outputs[:,1])
+        auc = roc_auc_score(y_true=label, y_score=outputs[:,1])
+        precision, recall, fscore = precision_recall_curve(y_true=label, probas_pred=outputs[:,1])
+        metrics_val['fprs'].append(fpr)
+        metrics_val['tprs'].append(tpr)
+        metrics_val['aucs'][k] = auc
+        metrics_val['precisions'].append(precision)
+        metrics_val['recalls'].append(recall)
+        plot_roc_pr_curves(metrics_val, 'results/roc_pr_val_curve_' + NN_model_string+ '_' + model_no + '.svg', plot_roc_lines=True)
+
+if use_kfold:
+    hist['train_loss'] = np.divide(hist['train_loss'], nb_folds)
+    hist['val_loss'] = np.divide(hist['val_loss'], nb_folds)
+    hist['val_acc'] = np.divide(hist['val_acc'], nb_folds)
+
+# test NN using test dataset
+print('--------------------------')
+print('Testing on test dataset...')
+# Divide set into batches
+graph_test_loader = geom_loader.DataLoader(dataset_test, batch_size=1)
+
+test_loss = 0
+test_acc = 0
+metrics_test = {'base_fpr':np.linspace(0, 1, len(dataset_test)), 'tprs':np.zeros((nb_folds, len(dataset_test))), 'precisions':[], 'recalls':[], 'aucs':np.zeros(nb_folds)}
+for k in range_fold:
+    outputs, label = np.empty((0,dataset.nb_classes), dtype=np.float32), np.empty((0), dtype=np.int64) # for ROC PR plots of val dataset
+
+    for step, data in enumerate(graph_test_loader):
+        # Moving data to GPU memory
+        data = data.to(device)
+        
+        # Forward pass
+        test_pred = as_numpy_tuple(models[k].predictGraph(data, device))
+
+        test_loss += test_pred.loss
+        test_acc += test_pred.correct
+        outputs = np.concatenate((outputs, val_pred.pred), axis=0)
+        label = np.concatenate((label, val_pred.gt), axis=0)
+
+        if save_models:
+            scores.add_rows(conformation_names=data.pair, folds=[-1 for i in range(len(test_pred.gt))], scores=test_pred.pred[:, 1], time_of_predictions=[np.NaN for i in range(len(test_pred.gt))], targets=test_pred.gt, epoch=-1)
+
+        # Move back data to CPU and flush GPU memory
+        if device.type != "cpu":
+            torch.cuda.empty_cache()
+
+    if dataset.nb_classes == 2:
+        fpr, tpr, _ = roc_curve(y_true=label, y_score=outputs[:,1])
+        auc = roc_auc_score(y_true=label, y_score=outputs[:,1])
+        precision, recall, fscore = precision_recall_curve(y_true=label, probas_pred=outputs[:,1])
+        interp_tpr = np.interp(metrics_test['base_fpr'], fpr, tpr)
+        interp_tpr[0] = 0.0
+        metrics_test['tprs'][k] = interp_tpr
+        metrics_test['aucs'][k] = auc
+        metrics_test['precisions'].append(precision)
+        metrics_test['recalls'].append(recall)
+        plot_roc_pr_curves(metrics_test, 'results/roc_pr_test_curve_' + NN_model_string+ '_' + model_no + '.svg')
+
+
+test_loss /= (len(graph_test_loader) * nb_folds)
+test_acc /= (len(dataset_test) * nb_folds)
+
+print('Test loss: ', test_loss)
+print('Test accuracy: %4.2f%%' % (100.0*float(test_acc)))
+print('--------------------------')
+
+# plot figure
+fig, ax = plt.subplots(1,1)
+ax.plot([e for e in range(0, epochs)], hist["train_loss"], label="train_loss")
+ax.plot([e for e in range(0, epochs)], hist["val_loss"], label="val_loss")
+ax.plot([e for e in range(0, epochs)], hist["val_acc"], label="val_acc")
+
+#ax.set_xticks(np.arange(0, epochs, 1.0))
+#ax.set_xticklabels(np.arange(1, epochs+1, 1))
+ax.set_xlabel("epoch")
+ax.legend()
+fig.savefig('results/' + NN_model_string + '_' + model_no + '.svg')
+
+# info of run
+split_df.to_csv('results/splits_' + NN_model_string+ '_' + model_no + '.csv', mode='a', header=True)
+scores.export('results/prediction_results_' + NN_model_string + '_' + model_no +'.csv')
+f= open("results/" + NN_model_string + "_" + model_no + ".txt", "w+")
+f.write("Data from " + data_dir + "\n")
+f.write("NN_model: " + NN_model_string + "\n")
+f.write("model: " + str(models[0]) + "\n")
+f.write("epochs: " + str(epochs) + "\n")
+f.write("batch size: " + str(batch_size) + "\n")
+end_time = datetime.now()
+f.write("duration: " + str(end_time-start_time) + "\n")
+f.write("size train dataset: " + str(train_len) + "\n")
+f.write("size validation dataset: " + str(val_len) + "\n")
+f.write("size test dataset: " + str(test_len) + "\n")
+# class proportions
+f.close()
+
+# time
+print('Duration: {}'.format(end_time - start_time))
\ No newline at end of file
--- a/InteractionGNN-main/interaction_gnn.yml
+++ b/InteractionGNN-main/interaction_gnn.yml
+name: interaction_gnn
+channels:
+  - pytorch
+  - pyg
+  - nvidia
+  - bioconda
+  - conda-forge
+  - defaults
+dependencies:
+  - blas=2.113=mkl
+  - blas-devel=3.9.0=13_win64_mkl
+  - brotli=1.0.9=h8ffe710_6
+  - brotli-bin=1.0.9=h8ffe710_6
+  - brotlipy=0.7.0=py39hb82d6ee_1003
+  - bzip2=1.0.8=h8ffe710_4
+  - ca-certificates=2021.10.8=h5b45459_0
+  - certifi=2021.10.8=py39hcbf5309_2
+  - cffi=1.15.0=py39h0878f49_0
+  - charset-normalizer=2.0.12=pyhd8ed1ab_0
+  - colorama=0.4.4=pyh9f0ad1d_0
+  - cryptography=36.0.2=py39h7bc7c5c_0
+  - cuda-cccl=11.6.55=hd268e57_0
+  - cuda-command-line-tools=11.6.2=h65bbf44_0
+  - cuda-compiler=11.6.2=h65bbf44_0
+  - cuda-cudart=11.6.55=h5fb1900_0
+  - cuda-cuobjdump=11.6.124=h8654613_0
+  - cuda-cupti=11.6.124=h532822a_0
+  - cuda-cuxxfilt=11.6.124=h3f9c74b_0
+  - cuda-libraries=11.6.2=h65bbf44_0
+  - cuda-libraries-dev=11.6.2=h65bbf44_0
+  - cuda-memcheck=11.6.124=hea6bc18_0
+  - cuda-nvcc=11.6.124=h769bc0d_0
+  - cuda-nvdisasm=11.6.124=he05ff55_0
+  - cuda-nvml-dev=11.6.55=h2bb381e_0
+  - cuda-nvprof=11.6.124=he581663_0
+  - cuda-nvprune=11.6.124=hb892de1_0
+  - cuda-nvrtc=11.6.124=h231bd66_0
+  - cuda-nvrtc-dev=11.6.124=hd7d06dc_0
+  - cuda-nvtx=11.6.124=hee9d5a4_0
+  - cuda-nvvp=11.6.124=h6a974fa_0
+  - cuda-sanitizer-api=11.6.124=ha4888a7_0
+  - cuda-toolkit=11.6.2=h65bbf44_0
+  - cuda-tools=11.6.2=h65bbf44_0
+  - cuda-visual-tools=11.6.2=h65bbf44_0
+  - cudatoolkit=11.5.0=hfde6d1b_9
+  - cycler=0.11.0=pyhd8ed1ab_0
+  - fonttools=4.25.0=pyhd3eb1b0_0
+  - freetype=2.10.4=h546665d_1
+  - icu=69.1=h0e60522_0
+  - idna=3.3=pyhd8ed1ab_0
+  - intel-openmp=2022.0.0=h57928b3_3663
+  - jbig=2.1=h8d14728_2003
+  - jinja2=3.1.1=pyhd8ed1ab_0
+  - joblib=1.1.0=pyhd8ed1ab_0
+  - jpeg=9e=h8ffe710_0
+  - kiwisolver=1.4.2=py39h2e07f2f_0
+  - lcms2=2.12=h2a16943_0
+  - lerc=3.0=h0e60522_0
+  - libblas=3.9.0=13_win64_mkl
+  - libbrotlicommon=1.0.9=h8ffe710_6
+  - libbrotlidec=1.0.9=h8ffe710_6
+  - libbrotlienc=1.0.9=h8ffe710_6
+  - libcblas=3.9.0=13_win64_mkl
+  - libclang=13.0.1=default_h81446c8_0
+  - libcublas=11.9.2.110=hb6c9e1a_0
+  - libcublas-dev=11.9.2.110=h1662081_0
+  - libcufft=10.7.2.124=hfa55d67_0
+  - libcufft-dev=10.7.2.124=hfb516c1_0
+  - libcurand=10.2.9.124=h99c9f72_0
+  - libcurand-dev=10.2.9.124=he731d31_0
+  - libcusolver=11.3.4.124=hc34ff3b_0
+  - libcusolver-dev=11.3.4.124=h4c0dfd9_0
+  - libcusparse=11.7.2.124=hfe5ea2b_0
+  - libcusparse-dev=11.7.2.124=h39db74a_0
+  - libdeflate=1.10=h8ffe710_0
+  - libffi=3.4.2=h8ffe710_5
+  - liblapack=3.9.0=13_win64_mkl
+  - liblapacke=3.9.0=13_win64_mkl
+  - libnpp=11.6.3.124=h516bb01_0
+  - libnpp-dev=11.6.3.124=hc355075_0
+  - libnvjpeg=11.6.2.124=hf97cc0b_0
+  - libnvjpeg-dev=11.6.2.124=h6c8d1d7_0
+  - libpng=1.6.37=h1d00b33_2
+  - libtiff=4.3.0=hc4061b1_3
+  - libuv=1.43.0=h8ffe710_0
+  - libwebp=1.2.2=h57928b3_0
+  - libwebp-base=1.2.2=h8ffe710_1
+  - libxcb=1.13=hcd874cb_1004
+  - libzlib=1.2.11=h8ffe710_1014
+  - lz4=3.1.10=py39h92e281b_0
+  - lz4-c=1.9.3=h8ffe710_1
+  - m2w64-gcc-libgfortran=5.3.0=6
+  - m2w64-gcc-libs=5.3.0=7
+  - m2w64-gcc-libs-core=5.3.0=7
+  - m2w64-gmp=6.1.0=2
+  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
+  - markupsafe=2.1.1=py39hb82d6ee_1
+  - matplotlib=3.5.1=py39hcbf5309_0
+  - matplotlib-base=3.5.1=py39h581301d_0
+  - mkl=2022.0.0=h0e2418a_796
+  - mkl-devel=2022.0.0=h57928b3_797
+  - mkl-include=2022.0.0=h0e2418a_796
+  - msys2-conda-epoch=20160418=1
+  - munkres=1.0.7=py_1
+  - networkx=2.7.1=pyhd8ed1ab_1
+  - numpy=1.22.3=py39h6331f09_0
+  - openjpeg=2.4.0=hb211442_1
+  - openssl=1.1.1n=h8ffe710_0
+  - packaging=21.3=pyhd8ed1ab_0
+  - pandas=1.4.1=py39h2e25243_0
+  - pillow=9.0.1=py39ha53f419_2
+  - pip=22.0.4=pyhd8ed1ab_0
+  - pthread-stubs=0.4=hcd874cb_1001
+  - pycparser=2.21=pyhd8ed1ab_0
+  - pyg=2.0.4=py39_torch_1.11.0_cu115
+  - pyopenssl=22.0.0=pyhd8ed1ab_0
+  - pyparsing=3.0.7=pyhd8ed1ab_0
+  - pyqt=5.12.3=py39hb0d2dfa_4
+  - pysocks=1.7.1=py39hcbf5309_4
+  - python=3.9.12=h9a09f29_1_cpython
+  - python-dateutil=2.8.2=pyhd8ed1ab_0
+  - python-louvain=0.15=pyhd8ed1ab_1
+  - python_abi=3.9=2_cp39
+  - pytorch=1.11.0=py3.9_cuda11.5_cudnn8_0
+  - pytorch-cluster=1.6.0=py39_torch_1.11.0_cu115
+  - pytorch-mutex=1.0=cuda
+  - pytorch-scatter=2.0.9=py39_torch_1.11.0_cu115
+  - pytorch-sparse=0.6.13=py39_torch_1.11.0_cu115
+  - pytorch-spline-conv=1.2.1=py39_torch_1.11.0_cu115
+  - pytz=2022.1=pyhd8ed1ab_0
+  - pyyaml=6.0=py39hb82d6ee_4
+  - qt=5.12.9=h556501e_6
+  - requests=2.27.1=pyhd8ed1ab_0
+  - scikit-learn=1.0.2=py39he931e04_0
+  - scipy=1.8.0=py39hc0c34ad_1
+  - setuptools=61.3.0=py39hcbf5309_0
+  - six=1.16.0=pyh6c4a22f_0
+  - sqlite=3.37.1=h8ffe710_0
+  - tbb=2021.5.0=h2d74725_0
+  - threadpoolctl=3.1.0=pyh8a188c0_0
+  - tk=8.6.12=h8ffe710_0
+  - tornado=6.1=py39hb82d6ee_3
+  - tqdm=4.63.1=pyhd8ed1ab_0
+  - typing_extensions=4.1.1=pyha770c72_0
+  - tzdata=2022a=h191b570_0
+  - ucrt=10.0.20348.0=h57928b3_0
+  - urllib3=1.26.9=pyhd8ed1ab_0
+  - vc=14.2=hb210afc_6
+  - vs2015_runtime=14.29.30037=h902a5da_6
+  - wheel=0.37.1=pyhd8ed1ab_0
+  - win_inet_pton=1.1.0=py39hcbf5309_4
+  - xorg-libxau=1.0.9=hcd874cb_0
+  - xorg-libxdmcp=1.1.3=hcd874cb_0
+  - xz=5.2.5=h62dcd97_1
+  - yacs=0.1.8=pyhd8ed1ab_0
+  - yaml=0.2.5=h8ffe710_2
+  - zlib=1.2.11=h8ffe710_1014
+  - zstd=1.5.2=h6255e5f_0
+  - pip:
+    - pyqt5-sip==4.19.18
+    - pyqtchart==5.12
+    - pyqtwebengine==5.12.1
+prefix: C:\ProgramData\Anaconda3\envs\interaction_gnn
--- a/InteractionGNN-main/src/config.py
+++ b/InteractionGNN-main/src/config.py
+from src.model import GAT, GCN, LinearNetwork
+from configparser import ConfigParser, Error
+from datetime import datetime
+
+
+def read_config(path):
+    config_object = ConfigParser()
+    config_object.read(path)
+
+    
+    # datetime object containing current date and time
+    now = datetime.now()
+    dt_string = now.strftime("%d_%m_%Y_%Hh%Mm%Ss")
+    model_no = "run_" + str(dt_string)
+
+    runinfo = config_object['RUNINFO']
+    modelinfo = config_object['MODELINFO']
+    runparams = config_object['RUNPARAMS']
+    runinfo['model_no'] = model_no
+
+    return runinfo, modelinfo, runparams
+
+# functions to read config file
+def str_to_bool(s):
+    if s == 'True':
+         return True
+    elif s == 'False':
+         return False
+
+def str_to_model(s):
+    if s == "GCN":
+        return GCN
+    elif s == "GAT":
+        return GAT
+    elif s == "Linear":
+        return LinearNetwork
+    else:
+        raise Error("Model error: {} not recognized as a model".format(s))
\ No newline at end of file
--- a/InteractionGNN-main/src/data.py
+++ b/InteractionGNN-main/src/data.py
+import os, functools, math, csv, random, sys
+import json as json
+import warnings
+import numpy as np
+import pandas as pd
+from tqdm.auto import tqdm
+
+import torch.utils.data
+
+import torch
+from torch_geometric.data import Data, Dataset, Batch, DataLoader
+from src.utils import load_with_compression, euclidian_distance
+import logging
+
+from src.dataset_utils import *
+
+#basic logging config
+logging.basicConfig(filename="data.log",level=logging.DEBUG)
+
+
+"""
+Dataset class to handle files from protein interfaces.
+Data are not loaded yet, it needs to be with get(index).
+
+Returns:
+	Dataset: A dataset containing protein data files
+"""
+class ProteinDataset(Dataset):
+	def __init__(self, data_dir, threshold=10.0, random_seed=123,
+				 device=torch.device('cpu'), nb_features=None, nb_classes=2, exclude_last=False, use_scr=True, specific_scr_fea=None):
+		super().__init__()
+
+		# Set device
+		self.device = device
+
+		# Set data directory
+		assert os.path.isdir(data_dir), '{} does not exist!'.format(data_dir)
+		self.data_dir = data_dir
+
+		# Set number of features (from the sample) and number of classes
+		self.nb_features = nb_features
+		self.nb_classes = nb_classes
+
+		# Param to exclude last feature of samples - for example if the sample files already contain scores from another algorithm
+		self.exclude_last = exclude_last
+
+		# Set specific SCR regions to use - otherwise, use all
+		if specific_scr_fea is not None:
+			if type(specific_scr_fea) is not list:
+				specific_scr_fea = [specific_scr_fea]
+			for scr_fea in specific_scr_fea:
+				if scr_fea not in ['R', 'C', 'S']:
+					raise ValueError('{} is not a valid SCR feature - must be \'R\', \'C\' or \'S\''.format(scr_fea))
+		self.specific_scr_fea = specific_scr_fea
+
+		self.process_samples()
+
+		random.seed(random_seed)
+		self.dist_calc  = euclidian_distance
+		self.threshold = threshold
+		self.use_scr = use_scr
+
+	def process_samples(self, samples_select=None):
+		# Process samples by splitting them by class
+		self.pairs = {c:[] for c in range(self.nb_classes)}
+		for p in os.listdir(self.data_dir):
+			if os.path.isdir(os.path.join(self.data_dir, p)):
+				for c in range(self.nb_classes):
+					if os.path.isdir(os.path.join(self.data_dir, p, str(c))):
+						# Store a tuple (protein pair, nb of samples) for each class
+						self.pairs[c].append((p,len(os.listdir(os.path.join(self.data_dir, p, '{}'.format(c))))))
+
+		# If samples were selected, keep only those
+		if samples_select is not None:
+			self.pairs = {c:[] for c in range(self.nb_classes)}
+			for s_idx in range(len(samples_select)):
+				for p,n in samples_select[s_idx].items():
+					# Store a tuple (protein pair, nb of samples) for each class
+					self.pairs[s_idx].append((p,n))
+
+		# Process final samples with name of files containing them
+		self.samples = {c:{} for c in range(self.nb_classes)}
+		self.nb_samples = {c:0 for c in range(self.nb_classes)}
+		self.samples_list = []
+		for c in self.pairs.keys():
+			for p,n in self.pairs[c]:
+				# Recreate name of files from protein pair and class
+				samples_p = [f for f in os.listdir(os.path.join(self.data_dir, p, '{}'.format(c)))]
+
+				# Set limit (selected samples, or all)
+				limit = min(n, len(samples_p))
+
+				for i in range(limit):
+					# Store dict of samples, per class and per protein pair
+					self.samples[c][p] = samples_p[i]
+					# Store count of sample per class
+					self.nb_samples[c] += 1
+					# Store list of samples with the filename, the class and the protein pair
+					self.samples_list.append((samples_p[i],c,p))
+
+
+	def reduce(self, nb_negative, nb_positive=-1):
+		# Reduce the dataset to a specific number of positive/negative samples
+		# Only for binary classification!
+		if self.nb_classes != 2:
+			raise NotImplementedError('ProteinDataset: reduce() is not implemented for non-binary classification')
+		if nb_negative == -1 and nb_positive == -1:
+			return
+
+		if nb_positive == -1 or nb_positive > self.nb_samples[1]:
+			warnings.warn('Reducing: Not enough positive samples in the dataset')
+			nb_positive = self.nb_samples[1]
+		if nb_negative == -1 or nb_negative > self.nb_samples[0]:
+			warnings.warn('Reducing: Not enough negative samples in the dataset')
+			nb_negative = self.nb_samples[0] 
+
+		count_samples = {0:0, 1:0}
+
+		# Keep only the first nb_positive/nb_negative samples
+		pairs_neg_select = {p:0 for p,n in self.pairs[0]}
+		pairs_pos_select = {p:0 for p,n in self.pairs[1]}
+		# Select negative samples by protein pair
+		for p, nb_s in self.pairs[0]:
+			if count_samples[0] < nb_negative:
+				if count_samples[0] + nb_s <= nb_negative:
+					pairs_neg_select[p] = nb_s
+				else:
+					pairs_neg_select[p] = nb_negative - count_samples[0]
+				count_samples[0] += pairs_neg_select[p]
+		# Select positive samples by protein pair
+		for p, nb_s in self.pairs[1]:
+			if count_samples[1] < nb_positive:
+				if count_samples[1] + nb_s <= nb_positive:
+					pairs_pos_select[p] = nb_s
+				else:
+					pairs_pos_select[p] = nb_positive - count_samples[1]
+				count_samples[1] += pairs_pos_select[p]
+		# Process samples again with the new selection
+		self.process_samples(samples_select=(pairs_neg_select, pairs_pos_select))
+
+	def __len__(self):
+		return sum([v for v in self.nb_samples.values()])
+
+	def one_hot_rcs(self, df, col_idx):
+		# Encode the RCS features in one-hot encoding
+		if self.use_scr:
+			unique_fea = pd.array(['R', 'C', 'S'])
+
+			for u in unique_fea:
+				df[u] = df.apply(lambda row: int(row[col_idx] == u), axis=1)
+
+		# Remove the original column
+		df = df.drop(col_idx, axis=1)
+		return df
+
+	def __getitem__(self, idxs):
+		if type(idxs) is int:
+			# One sample requested
+			return self.get(idxs)
+		elif type(idxs) is list or tuple:
+			# Multiple samples requested
+			return [self.get(idx) for idx in idxs if self.get(idx) is not None]
+		else:
+			raise ValueError('Dataset.__getitem__() can\'t return items from indexes of type ', type(idxs))
+
+	def get_idx(self, idx):
+		# Get a sample from its index
+		return self.get(idx)
+
+	def get(self, idx):
+		# Compute sample from the stored informations
+
+		# Extract filename, class and protein pair from the index
+		sample, label, pair = self.samples_list[idx]
+		path = os.path.join(self.data_dir, pair, str(label), sample)
+
+		# Load the sample from its file
+		df = pd.read_csv(path, sep=',',header=None, index_col=None)
+
+		# Drop NaN columns (if any)
+		df.dropna(axis=1, how='all', inplace=True)
+
+		# Analyze size of the dataset
+		nb_rows, nb_cols = df.shape[0], df.shape[1]			
+		nb_features = nb_cols - 3
+
+		atoms_features = []
+		edges_from = []
+		edges_to = []
+
+		if nb_features != self.nb_features:
+			raise ValueError('Dataset.get(): Wrong number of features in the dataset (expected {}, got {})'.format(self.nb_features, nb_features))
+
+		if self.exclude_last:
+			# Exclude last feature - for example if the sample files already contain scores from another algorithm
+			df = df.drop(nb_cols-1, axis=1)
+			nb_cols -= 1
+
+		if df[3].dtype == np.dtype('object'):
+			# Dataset is containing RCS regions - encode them
+			if self.specific_scr_fea:
+				# If specific RCS regions are requested, keep only those
+				df = df[df[3].isin(self.specific_scr_fea)].reset_index(drop=True)
+				nb_rows = df.shape[0]
+			df = self.one_hot_rcs(df=df, col_idx=3)
+		else:
+			raise ValueError('SCR features should be in index 3')
+
+		if len(df) == 0:
+			warnings.warn('Dataset.get(): Empty data for sample {}'.format(sample))
+			return None
+
+
+		for i in range(nb_rows):
+			# For each row (residue, atom or cubes), extract the features
+			atoms_features.append(df.loc[i][3:].values)
+
+			# Extract neighbours and build edges
+			for j in range(nb_rows):
+				if i != j:
+					distance = self.dist_calc(df.loc[i][:3].values, df.loc[j][:3].values)
+					if distance <= self.threshold:
+						edges_from.append(i)
+						edges_to.append(j)
+		
+		
+		protein_atom_fea = torch.Tensor(np.array(atoms_features))  # Atom features
+		if self.nb_features is not None:
+			if self.use_scr:
+				expected_features = self.nb_features + 2 # 1 RCS feature was encoded into a vector of length 3
+			else:
+				expected_features = self.nb_features - 1 # 1 RCS feature was dropped
+			# Check if the features shape is correct
+			if (not self.exclude_last and protein_atom_fea.shape[1] != expected_features) or (self.exclude_last and protein_atom_fea.shape[1] != expected_features - 1):
+				logging.info('Dataset: Sample {}_{} from pair {} did not possess the number of features expected'.format(sample, label, pair))
+				return None
+		
+		protein_edges_idx = torch.LongTensor(np.array([edges_from, edges_to]))  # Edge connectivity - shape [2, num_edges]
+
+		# Class Data from torch_geometric.data, with additional attributes 'protein pair' and 'sample filename'
+		return Data(x=protein_atom_fea, edge_index=protein_edges_idx, y=label, idx=idx, pair=pair, sample=os.path.splitext(sample)[0])
+
+
+def StratifiedSplit(dataset, first_partition=1.0, second_partition=0.0, shuffle=False, seed=random.seed(), in_memory=False, nb_classes=2):
+	# Custom stratified split function for the dataset
+
+	dataset_len = len(dataset)
+
+	if first_partition + second_partition != 1.0:
+		raise ValueError('StratifiedSplit error: Sum of split percentages must be equals to 1.0.')
+
+	data_dict = {}
+	# Create "class => list of data" Dict
+	# => dataset.get(i) is calculating and loading the data, so it is computationally heavy!
+	for i in tqdm(range(dataset_len), desc='Splitting data...', disable=in_memory):
+		if not in_memory: # Data need to be loaded - computationally heavy!
+			data = dataset.get(i)
+		else: # Data were already loaded - light
+			data = dataset[i]
+		
+		if data is not None:
+			if data.pair not in data_dict:
+				data_dict[data.pair] = {k:[] for k in range(nb_classes)}
+			data_dict[data.pair][data.y].append(data)
+
+	# Build splits
+	first_split, second_split = [], []
+	
+	np.random.seed(seed)
+
+	acc_keys = [k for k in data_dict.keys()]
+	split_pcount = 0
+	split1_limit = len(acc_keys) * first_partition
+	if shuffle:
+		random.shuffle(acc_keys) # Shuffle keys
+	for p in acc_keys:
+		# Create first split
+		if split_pcount < round(split1_limit) or first_partition == 1.0:
+			for k in range(nb_classes):
+				for s in data_dict[p][k]:
+					first_split.append(s)
+			split_pcount += 1
+		# Create second split
+		else:
+			for k in range(nb_classes):
+				for s in data_dict[p][k]:
+					second_split.append(s)
+
+	del data_dict  # To free the memory used
+
+	# Shuffle again to shuffle classes within the sets
+	if shuffle:
+		random.shuffle(first_split)
+		random.shuffle(second_split)
+
+	return first_split, second_split
+
+
+def reduce_split(split, limits={0:100, 1:100}, nb_classes=2):#
+	# Reduce a split to a specific number of samples - e.g. to have a balanced validation set
+	pruned_split = []
+	count_reduced = {k:0 for k in range(nb_classes)}
+	for data in split:
+		if count_reduced[data.y] < limits[data.y]:
+			pruned_split.append(data)
+			count_reduced[data.y] += 1
+	return pruned_split
+
+
+
+class kFoldStratifiedCrossValidation:
+	def __init__(self, dataset, nb_folds=5, nb_classes=2, shuffle=False, seed=random.seed()):
+		# Custom stratified cross-validation function for the dataset
+		if len(dataset) < nb_folds*2:
+			raise ValueError('kFold error: please use kFold with enough data in the dataset.')
+
+		self.dataset = dataset
+		if shuffle:
+			random.shuffle(dataset)
+		self.nb_folds = nb_folds
+		self.nb_classes = nb_classes
+		self.dataset_len = len(dataset)
+
+		np.random.seed(seed)
+		self.shuffle = shuffle
+
+		# Decide if the validation set as to be balanced or not
+		self.folds = [[] for f in range(nb_folds)]
+
+		# Build folds
+		self.build_folds()
+
+	def build_folds(self):
+		data_dict = {}
+		# Create "class => list of data" Dict
+		for i in tqdm(range(self.dataset_len), desc='Constructing kFold...'):
+			data = self.dataset[i]
+
+			if data.pair not in data_dict:
+				data_dict[data.pair] = {k:[] for k in range(self.nb_classes)}
+
+			data_dict[data.pair][data.y].append(data)
+		
+		acc_keys = [k for k in data_dict.keys()]
+		nb_samples = len(acc_keys)
+		if self.shuffle:
+			random.shuffle(acc_keys) # Shuffle keys
+		nb_samples_per_fold = nb_samples / self.nb_folds
+		if nb_samples < self.nb_folds:
+			raise ValueError('kFold error: please use kFold with enough data in the dataset.')
+		for i in range(nb_samples):
+			i_fold = int(i / nb_samples_per_fold)
+			if i_fold > self.nb_folds -1:
+				raise ValueError('Unexpected bug during StratifiedSplit initialization')
+			# Push sample into the fold
+			for t,s in data_dict[acc_keys[i]].items():
+				self.folds[i_fold].extend(s)
+
+		# Shuffle folds
+		for f in self.folds:
+			if self.shuffle:
+				random.shuffle(f)
+		if self.shuffle:
+			random.shuffle(self.folds)
+
+		del data_dict  # To free the memory used
+
+	def __iter__(self):
+		# Iterate over folds
+		self.val_idx = -1
+		return self
+
+	def __next__(self):  # return (train_set, val_set)
+		if self.val_idx >= self.nb_folds-1:
+			raise StopIteration
+		
+		self.val_idx = (self.val_idx + 1) % self.nb_folds
+
+		# Create train and validation sets regarding the fold index
+		train_set, val_set = [], []
+		for i_fold in range(self.nb_folds):
+			if i_fold == self.val_idx:
+				val_set.extend(self.folds[i_fold])
+			else:
+				train_set.extend(self.folds[i_fold])
+
+		return train_set, val_set
+
--- a/InteractionGNN-main/src/dataset_utils.py
+++ b/InteractionGNN-main/src/dataset_utils.py
+import ast, os
+
+def transform_Dockground_name(pair, sample):
+    if '\uf03a' in pair:
+        pair1, pair2 = pair.split('\uf03a')
+        pair2 = pair1.split('_')[0] + '_' + pair2
+        sample_id = sample.split('_')[2]
+        return pair1 + '--' + pair2, pair1 + '--' + pair2 + '_' + sample_id
+    elif len(pair.split('_')) == 3:
+        base, pair1, pair2 = pair.split('_')
+        pair1 = base + '_' + pair1
+        pair2 = base + '_' + pair2
+        sample_id = sample.split('_')[3]
+        return pair1 + '--' + pair2, pair1 + '--' + pair2 + '_' + sample_id
+    else:
+        raise ValueError('Dockground utils: problem with pair {}'.format(pair))
+
+def select_Dockground_split(folder_path, dataset):
+    groups_txt_path = os.path.join(folder_path, 'groups.txt')
+    with open(groups_txt_path) as f:
+        data = f.read()
+
+    data = data.split('=')[1].replace(' ', '').strip().upper()
+    split_dict = ast.literal_eval(data)
+
+    splits = [[] for f in range(len(split_dict))]
+    for d in dataset:
+        pair1, pair2 = d.pair.split('--')
+        for key_fold, values in enumerate(split_dict.values()): # Use enumerate() to assure presence of correct keys (i.e. from 0 to len(split_dict) - 1)
+            for v in values:
+                if v in pair1.upper() and v in pair2.upper():
+                    splits[key_fold].append(d)
+                elif v not in pair1.upper() and v not in pair2.upper():
+                    continue
+                else:
+                    raise ValueError('Dockground utils: problem with pair {}'.format(d.pair))
+    
+    predefined_splits = [{'train':[], 'val':[]} for f in range(len(split_dict))]
+    for k in range(len(split_dict)):
+        predefined_splits[k]['val'] = splits[k]
+        for fold in range(len(split_dict)):
+            if fold != k:
+                predefined_splits[k]['train'] += splits[fold]
+    return predefined_splits
\ No newline at end of file
--- a/InteractionGNN-main/src/metrics.py
+++ b/InteractionGNN-main/src/metrics.py
+import numpy as np
+import pandas as pd
+
+class ScoreDataframe():
+    # Class to extract scores and relevant metrics from a dataframe
+    def __init__(self):
+        self.df = pd.DataFrame({'conformation_name': [], 'fold': [], 'empty_cell_1': [], 'empty_cell_2': [], 'score': [], 'time_of_prediction': [], 'target': [], 'epoch': []})
+        self.df = self.df.astype({'conformation_name': 'str', 'fold': 'int', 'score': 'float', 'time_of_prediction': 'float', 'target': 'int', 'epoch': 'int'})
+
+    def add_row(self, conformation_name, fold, score, time_of_prediction, target, epoch):
+        # Add one result to the dataframe
+        if score < 0.0 or score > 1.0:
+            raise ValueError('Score must be between 0 and 1')
+        
+        self.df = pd.concat([self.df, {'conformation_name': conformation_name, 'fold': fold, 'score': score, 'time_of_prediction': time_of_prediction, 'target': target, 'epoch': epoch}], ignore_index=True)
+    
+    def add_rows(self, conformation_names, folds, scores, time_of_predictions, targets, epoch):
+        # Add multiple results to the dataframe
+        if len(conformation_names) != len(folds) or len(conformation_names) != len(scores) or len(conformation_names) != len(time_of_predictions) or len(conformation_names) != len(targets):
+            raise ValueError('All lists must be of the same length')
+
+        news = {'conformation_name': conformation_names, 'fold': folds, 'score': scores, 'time_of_prediction': time_of_predictions, 'target': targets, 'epoch': [epoch for _ in range(len(conformation_names))]}
+        new_df = pd.DataFrame(news)
+        self.df = pd.concat([self.df, new_df], ignore_index=True)
+
+    def export(self, savepath):
+        # Export the dataframe to a csv file
+        self.df.to_csv(savepath, header=True)
--- a/InteractionGNN-main/src/model.py
+++ b/InteractionGNN-main/src/model.py
+from copy import Error
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch_geometric.nn import GCNConv, GATConv, Sequential, global_mean_pool, BatchNorm
+from torch.nn import Linear, Dropout
+from torch_geometric.data import Data
+from sklearn.utils.class_weight import compute_class_weight
+import collections
+
+# NamedTuple for storing the output of the model
+Prediction = collections.namedtuple('Prediction', ['loss', 'pred', 'gt', 'correct'])
+
+
+class GCN(torch.nn.Module):
+    def __init__(self,
+            dim,
+            activation_function=nn.ELU,
+            final_activation=nn.Softmax(dim=1),
+            loss_function=F.cross_entropy,
+            **kwargs):
+        super(GCN, self).__init__()
+        torch.manual_seed(12345)
+        # self.dim_atoms = kwargs.get('dim_atoms')
+        # self.dim_bonds = kwargs.get('dim_bonds')
+
+        # size of input/output node feature dimension, self-loops and symmetric normalization
+
+        self.in_dim = dim
+        self.loss_function = loss_function
+        self.final_activation = final_activation
+
+        self.define_architecture(dim)
+
+    def define_architecture(self, dim):
+        if dim > 16:
+            self.model = Sequential('x, edge_index, batch', [
+                (BatchNorm(in_channels=dim), 'x -> x'),
+                (GCNConv(dim, 100), 'x, edge_index -> x'),
+                nn.ELU(inplace=True),
+                (global_mean_pool, 'x, batch -> x'),
+                (BatchNorm(in_channels=100), 'x -> x'),
+                (Dropout(p=0.5), 'x -> x'),
+                (Linear(100, 20), 'x -> x'),
+                nn.ELU(inplace=True),
+                (BatchNorm(in_channels=20), 'x -> x'),
+                (Dropout(p=0.2), 'x -> x'),
+                (Linear(20, 2), 'x -> x'),
+                ]
+            )
+        else:
+            self.model = Sequential('x, edge_index, batch', [
+                (BatchNorm(in_channels=dim), 'x -> x'),
+                (GCNConv(dim, 20), 'x, edge_index -> x'),
+                nn.ELU(inplace=True),
+                (global_mean_pool, 'x, batch -> x'),
+                (BatchNorm(in_channels=20), 'x -> x'),
+                (Dropout(p=0.2), 'x -> x'),
+                (Linear(20, 2), 'x -> x'),
+                ]
+            )
+
+    def reinit_gcn(self, dim,
+            activation_function=nn.ELU,
+            final_activation=nn.Softmax(dim=1),
+            loss_function=F.cross_entropy,
+            **kwargs):
+        # Use it to redefine a network without restarting the pipeline
+        return GCN.__new__(dim, activation_function, final_activation, loss_function, **kwargs)
+
+    def forward(self, data):
+        x, edge_index, batch = data.x, data.edge_index, data.batch
+        return self.model(x, edge_index, batch)
+
+    # data of one batch as input of NN -> calculate loss
+    def trainGraph(self, batch, optimizer, device=torch.device('cuda'), weights=None):
+        # Move data to GPU if available
+        batch = batch.to(device)
+
+        self.train()
+        # Clear gradients
+        optimizer.zero_grad()
+
+        # Forward pass
+        out = self(batch)
+        target = batch.y
+
+        # Calculate loss
+        loss = F.cross_entropy(out, target, reduction='mean', weight=weights) # combines log_softmax and nll_loss
+
+        # Backward pass
+        loss.backward()  # derive gradients
+        optimizer.step() # update weights
+
+        # Get final prediction
+        out = self.final_activation(out) # Use Softmax to get probability
+        prediction = out.argmax(dim=1)  # Use the class with highest probability.
+
+        # Calculate accuracy
+        correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
+        return Prediction(loss, out, batch.y, correct)
+
+    @torch.no_grad()
+    def predictGraph(self, batch, device=torch.device('cuda'), weights=None):
+        # Move data to GPU if available
+        batch = batch.to(device)
+
+        self.eval()
+        # Forward pass
+        out = self(batch)
+        target = batch.y
+
+        # Calculate loss
+        loss = F.cross_entropy(out, target, reduction='mean', weight=weights)
+
+        # Get final prediction
+        out = self.final_activation(out) # Use Softmax to get probability
+        prediction = out.argmax(dim=1)  # Use the class with highest probability.
+
+        # Calculate accuracy
+        correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
+        return Prediction(loss, out, batch.y, correct)
+
+
+class LinearNetwork(torch.nn.Module):
+    def __init__(self,
+            dim,
+            architecture_type=None, # Use it for AutoGL
+            activation_function=nn.ELU,
+            final_activation=nn.Softmax(dim=1),
+            loss_function=F.cross_entropy,
+            **kwargs):
+        super(LinearNetwork, self).__init__()
+        torch.manual_seed(12345)
+        # self.dim_atoms = kwargs.get('dim_atoms')
+        # self.dim_bonds = kwargs.get('dim_bonds')
+
+        # size of input/output node feature dimension, self-loops and symmetric normalization
+
+        self.in_dim = dim
+        self.loss_function = loss_function
+
+        if dim > 21:
+            self.model = Sequential('x, edge_index, batch', [
+                (global_mean_pool, 'x, batch -> x'),
+                (BatchNorm(in_channels=dim), 'x -> x'),
+                (Linear(dim, 20), 'x -> x'),
+                (nn.ELU(inplace=True), 'x -> x'),
+                (BatchNorm(in_channels=20), 'x -> x'),
+                (Linear(20, 1), 'x -> x'),
+                (nn.Sigmoid(), 'x -> x'),
+                ]
+            )
+        else:
+            self.model = Sequential('x, edge_index, batch', [
+                (global_mean_pool, 'x, batch -> x'),
+                (BatchNorm(in_channels=dim), 'x -> x'),
+                (Linear(dim, 1), 'x -> x'),
+                (nn.Sigmoid(), 'x -> x'),
+                ]
+            )
+
+
+    def forward(self, data):
+        x, edge_index, batch = data.x, data.edge_index, data.batch
+        return self.model(x, edge_index, batch)
+
+    # data of one batch as input of NN -> calculate loss
+    def trainGraph(self, batch, optimizer, device=torch.device('cuda'), weights=None):
+        # Move data to GPU if available
+        batch = batch.to(device)
+        
+        self.train()
+        # Clear gradients
+        optimizer.zero_grad()
+
+        # Forward pass
+        out = self(batch)
+        target = batch.y
+        if out.shape[1] != 1:
+            raise ValueError("Expected shape of probabilities should be (N, 1)")
+        out = out[:, 0]
+
+        # Calculate loss
+        loss = F.binary_cross_entropy(out, target.float(), reduction='mean', weight=weights)
+
+        # Backward pass
+        loss.backward()  # derive gradients
+        optimizer.step() # update params based on gradients
+
+        # Get final prediction
+        prediction = [1 if p >= 0.5 else 0 for p in out]
+        # Calculate accuracy
+        correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
+        return Prediction(loss, out, batch.y, correct)
+
+    @torch.no_grad()
+    def predictGraph(self, batch, device=torch.device('cuda'), weights=None):
+        # Move data to GPU if available
+        batch = batch.to(device)
+
+        self.eval()
+        # Forward pass
+        out = self(batch)
+        target = batch.y
+        if out.shape[1] != 1:
+            raise ValueError("Expected shape of probabilities should be (N, 1)")
+        out = out[:, 0]
+
+        # Calculate loss
+        loss = F.binary_cross_entropy(out, target.float(), reduction='mean', weight=weights)
+
+        # Get final prediction
+        prediction = [1 if p >= 0.5 else 0 for p in out]
+        # Calculate accuracy
+        correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
+        return Prediction(loss, out, batch.y, correct)
+
+
+class GAT(torch.nn.Module):
+    def __init__(self,
+            dim,
+            activation_function=nn.ELU,
+            final_activation=nn.Softmax(dim=1),
+            loss_function=F.cross_entropy,
+            **kwargs):
+        super(GAT, self).__init__()
+        torch.manual_seed(12345)
+
+        
+        self.in_dim = dim
+        self.loss_function = loss_function
+        self.final_activation = final_activation
+
+        self.define_architecture(dim)
+
+    def define_architecture(self, dim):
+        if dim > 16:
+            self.model = Sequential('x, edge_index, batch', [
+                (BatchNorm(in_channels=dim), 'x -> x'),
+                (GATConv(dim, 100), 'x, edge_index -> x'),
+                nn.ELU(inplace=True),
+                (global_mean_pool, 'x, batch -> x'),
+                (BatchNorm(in_channels=100), 'x -> x'),
+                (Dropout(p=0.5), 'x -> x'),
+                (Linear(100, 20), 'x -> x'),
+                nn.ELU(inplace=True),
+                (BatchNorm(in_channels=20), 'x -> x'),
+                (Dropout(p=0.2), 'x -> x'),
+                (Linear(20, 2), 'x -> x'),
+                ]
+            )
+        else:
+            self.model = Sequential('x, edge_index, batch', [
+                (BatchNorm(in_channels=dim), 'x -> x'),
+                (GATConv(dim, 20), 'x, edge_index -> x'),
+                nn.ELU(inplace=True),
+                (global_mean_pool, 'x, batch -> x'),
+                (BatchNorm(in_channels=20), 'x -> x'),
+                (Dropout(p=0.2), 'x -> x'),
+                (Linear(20, 2), 'x -> x'),
+                ]
+            )
+
+    def reinit_gat(self, dim,
+            activation_function=nn.ELU,
+            final_activation=nn.Softmax(dim=1),
+            loss_function=F.cross_entropy,
+            **kwargs):
+        # Use it to redefine a network without restarting the pipeline
+        return GAT.__new__(dim, activation_function, final_activation, loss_function, **kwargs)
+
+    def forward(self, data):
+        x, edge_index, batch = data.x, data.edge_index, data.batch
+        return self.model(x, edge_index, batch)
+
+    def trainGraph(self, batch, optimizer, device=torch.device('cuda'), weights=None):
+        # Move data to GPU if available
+        batch = batch.to(device)
+        
+        self.train()
+        # Clear gradients
+        optimizer.zero_grad()  # clear gradients
+
+        # Forward pass
+        out = self(batch)
+        target = batch.y
+
+        # Calculate loss
+        loss = F.cross_entropy(out, target, reduction='mean', weight=weights)
+
+        # Backward pass
+        loss.backward()  # derive gradients
+        optimizer.step() # update weights
+
+        # Get final prediction
+        out = self.final_activation(out) # Use Softmax to get probability
+        prediction = out.argmax(dim=1)  # Use the class with highest probability.
+        # Calculate accuracy
+        correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
+        return Prediction(loss, out, batch.y, correct)
+
+    @torch.no_grad()
+    def predictGraph(self, batch, device=torch.device('cuda'), weights=None):
+        # Move data to GPU if available
+        batch = batch.to(device)
+
+        self.eval()
+        # Forward pass
+        out = self(batch)
+        target = batch.y
+
+        # Calculate loss
+        loss = F.cross_entropy(out, target, reduction='mean', weight=weights)
+
+        # Get final prediction
+        out = self.final_activation(out) # Use Softmax to get probability
+        prediction = out.argmax(dim=1)  # Use the class with highest probability.
+
+        # Calculate accuracy
+        correct = sum([1 if p == t else 0 for p,t in zip(prediction, target)])
+        return Prediction(loss, out, batch.y, correct)
+
--- a/InteractionGNN-main/src/utils.py
+++ b/InteractionGNN-main/src/utils.py
+import numpy as np
+import torch
+import json, os
+import lz4.frame
+import pandas as pd
+import matplotlib.pyplot as plt
+import sklearn.metrics as skmetrics
+from sklearn.metrics import RocCurveDisplay
+import shelve
+
+from src.model import Prediction
+
+
+def as_numpy(values):
+    # Convert a Tensor or a list of Tensors to numpy structures
+    if type(values) is torch.Tensor:
+        return values.cpu().detach().numpy()
+    elif type(values) is list or type(values) is tuple:
+        output = [0 for i in range(len(values))]
+        for i in range(len(values)):
+            if type(values[i]) is torch.Tensor:
+                output[i] = values[i].cpu().detach().numpy()
+            else:
+                output[i] = values[i]
+        return output
+    else:
+        return values
+
+def as_numpy_tuple(pred:Prediction):
+    # Pass Prediction named tuple from the model to Prediction named tuple made of numpy arrays
+    return Prediction(as_numpy(pred.loss), as_numpy(pred.pred), as_numpy(pred.gt), as_numpy(pred.correct))
+
+
+def save_with_compression(obj, path, compression_level=lz4.frame.COMPRESSIONLEVEL_MINHC):
+    # Save a python object to a compressed file
+    with lz4.frame.open(path, mode='wb') as fp:
+        obj_bytes = json.dumps(obj).encode('utf-8')
+        compressed_obj = lz4.frame.compress(obj_bytes, compression_level=compression_level)
+        fp.write(compressed_obj)
+
+
+def load_with_compression(path):
+    # Load a python object from a compressed file
+    with lz4.frame.open(path, mode='r') as fp:
+        output_compressed_data = fp.read()
+        obj_bytes = lz4.frame.decompress(output_compressed_data)
+        obj = json.loads(obj_bytes.decode('utf-8'))
+        return obj
+
+
+
+
+def euclidian_distance(coords_from, coords_to):
+    # Compute the euclidian distance between two points
+	if len(coords_from) != len(coords_to):
+		raise ValueError('Euclidian distance: coords don\'t have the same shape')
+	else:
+		return np.sqrt(np.array([np.power(i-j, 2) for i,j in zip(coords_from, coords_to)]).sum())
+
+
+def plot_roc_pr_curves(metrics, savepath, plot_roc_lines=False):
+    # Plot meaningful ROC and PR curves
+    if not plot_roc_lines:
+        if len(np.unique(np.array([len(a) for a in metrics['tprs']]))) != 1:
+            raise ValueError("Error while plotting ROC curves")
+
+    # Plot ROC Curve...
+    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
+    # Plot ROC baseline
+    ax1.plot([0, 1], [0, 1], linestyle="--", lw=2, color="r", label="Chance", alpha=0.8)
+
+    if plot_roc_lines: # Plot each ROC curve (one per fold)
+        for fpr,tpr,auc in zip(metrics['fprs'], metrics['tprs'], metrics['aucs']):
+            ax1.plot(
+                fpr,
+                tpr,
+                label=r"Mean ROC (AUC = %0.2f)" % (auc),
+                lw=1,
+                alpha=0.8,
+            )
+    else:
+        # Plot mean ROC curve (one curve for all folds)
+        # Area between best and worst ROC curves are colored in grey
+        mean_tpr = np.mean(metrics['tprs'], axis=0)
+        mean_tpr[-1] = 1.0
+        mean_auc = skmetrics.auc(metrics['base_fpr'], mean_tpr)
+        std_auc = np.std(metrics['aucs'])
+        ax1.plot(
+            metrics['base_fpr'],
+            mean_tpr,
+            color="b",
+            label=r"Mean ROC (AUC = %0.2f $\pm$ %0.2f)" % (mean_auc, std_auc),
+            lw=2,
+            alpha=0.8,
+        )
+        std_tpr = np.std(metrics['tprs'], axis=0)
+        tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
+        tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
+        ax1.fill_between(
+            metrics['base_fpr'],
+            tprs_lower,
+            tprs_upper,
+            color="grey",
+            alpha=0.2,
+            label=r"$\pm$ 1 std. dev.",
+        )
+
+    # Label, legend and set axe limits
+    ax1.set(
+        xlim=[-0.05, 1.05],
+        ylim=[-0.05, 1.05],
+        title="Receiver Operating Characteristic (ROC) Curve",
+        xlabel='False Positive Rate',
+        ylabel='True Positive Rate'
+    )
+    ax1.legend(loc="lower right")
+
+    # Plot PR Curve...
+    if len(metrics['precisions']) != len(metrics['recalls']):
+        raise ValueError("Error while computing prediction / recall")
+    else:
+        nb_folds = len(metrics['precisions'])
+        # Plot each PR curve (one per fold)
+        for p,r,i in zip(metrics['precisions'], metrics['recalls'], range(nb_folds)):
+            ax2.plot(
+                r,
+                p,
+                label="fold {}".format(i),
+                lw=1
+            )
+        # Label, legend and set axe limits
+        ax2.legend(loc="lower right")
+        ax2.set(
+            xlim=[-0.05, 1.05],
+            ylim=[-0.05, 1.05],
+            title="Precision Recall Curve",
+            xlabel='Recall',
+            ylabel='Precision'
+        )
+
+    # Save figure
+    plt.savefig(savepath)
+    plt.close()
+
+
+
+
+def save_session(shelf_path):
+    # Save the current session
+    my_shelf = shelve.open(shelf_path,'n')
+
+    for key in globals().keys():
+        try:
+            my_shelf[key] = globals()[key]
+        except:
+            pass
+
+    my_shelf.close()
+
+def load_session(shelf_path):
+    # Load a session
+    my_shelf = shelve.open(shelf_path)
+    for key in my_shelf:
+        try:
+            globals()[key]=my_shelf[key]
+        except:
+            print('Not loaded:', key)
+            pass
+
--- a/InteractionGNN-main/tests/splits.py
+++ b/InteractionGNN-main/tests/splits.py
+import sys, os
+sys.path.insert(0, os.path.abspath(os.getcwd()))
+
+import numpy as np
+import torch
+import matplotlib.pyplot as plt
+import pandas as pd
+import torch_geometric.data as geom_data
+from src.data import ProteinDataset, StratifiedSplit, kFoldStratifiedCrossValidation, reduce_split
+from torch_geometric.data import Data, Dataset, Batch, DataLoader
+from src.model import GCN, GAT
+from src.utils import as_numpy, as_numpy_tuple, plot_roc_pr_curves
+from datetime import datetime
+from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_fscore_support, precision_recall_curve
+from sklearn.metrics import PrecisionRecallDisplay, RocCurveDisplay
+from configparser import ConfigParser
+from src.config import str_to_bool, str_to_model, read_config
+
+import unittest
+
+
+class TestStratifiedSplit(unittest.TestCase):
+    def testIndependance(self):
+        data_dir = './data/intermediate_m5_e30_nonorm_split1/'
+        contains_pairs = True
+        skiptest = False
+        use_kfold = True
+        nb_folds = 5
+
+
+        seed = 123  # Seed to fix the simulation
+        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+
+
+        # Build dataset
+        dataset = ProteinDataset(data_dir, random_seed=seed, device=device, nb_features=22)
+        dataset.reduce(nb_negative=200, nb_positive=200)
+
+        dataset_train, dataset_test = StratifiedSplit(dataset, first_partition=0.85, second_partition=0.15, shuffle=True)
+        # dataset_test = reduce_split(dataset_test, nb_positive=100, nb_negative=100)
+        train_len = (len(dataset_train)*(nb_folds -1)) / nb_folds
+        val_len = (len(dataset_train)) / nb_folds
+        test_len = len(dataset_test)
+
+        print()
+        print('Training on {} proteins.'.format(train_len))
+        print('Validating on {} proteins.'.format(val_len))
+        print('Testing on {} proteins.'.format(test_len))
+
+        kFoldStrCV = kFoldStratifiedCrossValidation(dataset_train, nb_folds=nb_folds, nb_classes=2, shuffle=True)
+        iter_kFold = iter(kFoldStrCV)
+        range_fold = range(nb_folds)
+
+        # Set tests
+        all_train_pairs = [d.pair for d in dataset_train]
+        test_pairs = [d.pair for d in dataset_test]
+
+        for i_p in range(len(test_pairs)):
+            with self.subTest(msg='Pair {}'.format(test_pairs[i_p])):
+                self.assertNotIn(test_pairs[i_p], all_train_pairs)
+
+        
+        for k in range_fold:
+            train_set, val_set = iter_kFold.__next__()
+            train_pairs = [d.pair for d in train_set]
+            val_pairs = [d.pair for d in val_set]
+            for i_p in range(len(val_pairs)):
+                with self.subTest(msg='Fold {} - Pair {}'.format(k, val_pairs[i_p])):
+                    self.assertNotIn(val_pairs[i_p], train_pairs)
+
+if __name__ == '__main__':
+    unittest.main()
--- a/Models/BM5/0_model/saved_model.pb
+++ b/Models/BM5/0_model/saved_model.pb
--- a/Models/BM5/0_model/variables/variables.data-00000-of-00001
+++ b/Models/BM5/0_model/variables/variables.data-00000-of-00001
--- a/Models/BM5/0_model/variables/variables.index
+++ b/Models/BM5/0_model/variables/variables.index
--- a/Models/BM5/1_model/saved_model.pb
+++ b/Models/BM5/1_model/saved_model.pb
--- a/Models/BM5/1_model/variables/variables.data-00000-of-00001
+++ b/Models/BM5/1_model/variables/variables.data-00000-of-00001
--- a/Models/BM5/1_model/variables/variables.index
+++ b/Models/BM5/1_model/variables/variables.index
--- a/Models/BM5/2_model/saved_model.pb
+++ b/Models/BM5/2_model/saved_model.pb
--- a/Models/BM5/2_model/variables/variables.data-00000-of-00001
+++ b/Models/BM5/2_model/variables/variables.data-00000-of-00001
--- a/Models/BM5/2_model/variables/variables.index
+++ b/Models/BM5/2_model/variables/variables.index
--- a/Models/BM5/3_model/saved_model.pb
+++ b/Models/BM5/3_model/saved_model.pb
--- a/Models/BM5/3_model/variables/variables.data-00000-of-00001
+++ b/Models/BM5/3_model/variables/variables.data-00000-of-00001
--- a/Models/BM5/3_model/variables/variables.index
+++ b/Models/BM5/3_model/variables/variables.index
--- a/Models/BM5/4_model/saved_model.pb
+++ b/Models/BM5/4_model/saved_model.pb
--- a/Models/BM5/4_model/variables/variables.data-00000-of-00001
+++ b/Models/BM5/4_model/variables/variables.data-00000-of-00001
--- a/Models/BM5/4_model/variables/variables.index
+++ b/Models/BM5/4_model/variables/variables.index
--- a/Models/BM5/5_model/saved_model.pb
+++ b/Models/BM5/5_model/saved_model.pb
--- a/Models/BM5/5_model/variables/variables.data-00000-of-00001
+++ b/Models/BM5/5_model/variables/variables.data-00000-of-00001
--- a/Models/BM5/5_model/variables/variables.index
+++ b/Models/BM5/5_model/variables/variables.index
--- a/Models/BM5/6_model/saved_model.pb
+++ b/Models/BM5/6_model/saved_model.pb
--- a/Models/BM5/6_model/variables/variables.data-00000-of-00001
+++ b/Models/BM5/6_model/variables/variables.data-00000-of-00001
--- a/Models/BM5/6_model/variables/variables.index
+++ b/Models/BM5/6_model/variables/variables.index
--- a/Models/BM5/7_model/saved_model.pb
+++ b/Models/BM5/7_model/saved_model.pb
--- a/Models/BM5/7_model/variables/variables.data-00000-of-00002
+++ b/Models/BM5/7_model/variables/variables.data-00000-of-00002
--- a/Models/BM5/7_model/variables/variables.data-00001-of-00002
+++ b/Models/BM5/7_model/variables/variables.data-00001-of-00002
--- a/Models/BM5/7_model/variables/variables.index
+++ b/Models/BM5/7_model/variables/variables.index
--- a/Models/BM5/8_model/saved_model.pb
+++ b/Models/BM5/8_model/saved_model.pb
--- a/Models/BM5/8_model/variables/variables.data-00000-of-00001
+++ b/Models/BM5/8_model/variables/variables.data-00000-of-00001
--- a/Models/BM5/8_model/variables/variables.index
+++ b/Models/BM5/8_model/variables/variables.index
--- a/Models/BM5/9_model/saved_model.pb
+++ b/Models/BM5/9_model/saved_model.pb
--- a/Models/BM5/9_model/variables/variables.data-00000-of-00001
+++ b/Models/BM5/9_model/variables/variables.data-00000-of-00001
--- a/Models/BM5/9_model/variables/variables.index
+++ b/Models/BM5/9_model/variables/variables.index
--- a/Models/CCD4PPI/SCR/0_model/saved_model.pb
+++ b/Models/CCD4PPI/SCR/0_model/saved_model.pb
--- a/Models/CCD4PPI/SCR/0_model/variables/variables.data-00000-of-00001
+++ b/Models/CCD4PPI/SCR/0_model/variables/variables.data-00000-of-00001
--- a/Models/CCD4PPI/SCR/0_model/variables/variables.index
+++ b/Models/CCD4PPI/SCR/0_model/variables/variables.index
--- a/Models/CCD4PPI/SCR/1_model/saved_model.pb
+++ b/Models/CCD4PPI/SCR/1_model/saved_model.pb
--- a/Models/CCD4PPI/SCR/1_model/variables/variables.data-00000-of-00001
+++ b/Models/CCD4PPI/SCR/1_model/variables/variables.data-00000-of-00001
--- a/Models/CCD4PPI/SCR/1_model/variables/variables.index
+++ b/Models/CCD4PPI/SCR/1_model/variables/variables.index
--- a/Models/CCD4PPI/SCR/2_model/saved_model.pb
+++ b/Models/CCD4PPI/SCR/2_model/saved_model.pb
--- a/Models/CCD4PPI/SCR/2_model/variables/variables.data-00000-of-00001
+++ b/Models/CCD4PPI/SCR/2_model/variables/variables.data-00000-of-00001
--- a/Models/CCD4PPI/SCR/2_model/variables/variables.index
+++ b/Models/CCD4PPI/SCR/2_model/variables/variables.index
--- a/Models/CCD4PPI/SCR/3_model/saved_model.pb
+++ b/Models/CCD4PPI/SCR/3_model/saved_model.pb
--- a/Models/CCD4PPI/SCR/3_model/variables/variables.data-00000-of-00001
+++ b/Models/CCD4PPI/SCR/3_model/variables/variables.data-00000-of-00001
--- a/Models/CCD4PPI/SCR/3_model/variables/variables.index
+++ b/Models/CCD4PPI/SCR/3_model/variables/variables.index
--- a/Models/CCD4PPI/SCR/4_model/saved_model.pb
+++ b/Models/CCD4PPI/SCR/4_model/saved_model.pb
--- a/Models/CCD4PPI/SCR/4_model/variables/variables.data-00000-of-00001
+++ b/Models/CCD4PPI/SCR/4_model/variables/variables.data-00000-of-00001
--- a/Models/CCD4PPI/SCR/4_model/variables/variables.index
+++ b/Models/CCD4PPI/SCR/4_model/variables/variables.index
--- a/Models/CCD4PPI/noSCRnoRL/0_model/saved_model.pb
+++ b/Models/CCD4PPI/noSCRnoRL/0_model/saved_model.pb
--- a/Models/CCD4PPI/noSCRnoRL/0_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/noSCRnoRL/0_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/noSCRnoRL/0_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/noSCRnoRL/0_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/noSCRnoRL/0_model/variables/variables.index
+++ b/Models/CCD4PPI/noSCRnoRL/0_model/variables/variables.index
--- a/Models/CCD4PPI/noSCRnoRL/1_model/saved_model.pb
+++ b/Models/CCD4PPI/noSCRnoRL/1_model/saved_model.pb
--- a/Models/CCD4PPI/noSCRnoRL/1_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/noSCRnoRL/1_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/noSCRnoRL/1_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/noSCRnoRL/1_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/noSCRnoRL/1_model/variables/variables.index
+++ b/Models/CCD4PPI/noSCRnoRL/1_model/variables/variables.index
--- a/Models/CCD4PPI/noSCRnoRL/2_model/saved_model.pb
+++ b/Models/CCD4PPI/noSCRnoRL/2_model/saved_model.pb
--- a/Models/CCD4PPI/noSCRnoRL/2_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/noSCRnoRL/2_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/noSCRnoRL/2_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/noSCRnoRL/2_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/noSCRnoRL/2_model/variables/variables.index
+++ b/Models/CCD4PPI/noSCRnoRL/2_model/variables/variables.index
--- a/Models/CCD4PPI/noSCRnoRL/3_model/saved_model.pb
+++ b/Models/CCD4PPI/noSCRnoRL/3_model/saved_model.pb
--- a/Models/CCD4PPI/noSCRnoRL/3_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/noSCRnoRL/3_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/noSCRnoRL/3_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/noSCRnoRL/3_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/noSCRnoRL/3_model/variables/variables.index
+++ b/Models/CCD4PPI/noSCRnoRL/3_model/variables/variables.index
--- a/Models/CCD4PPI/noSCRnoRL/4_model/saved_model.pb
+++ b/Models/CCD4PPI/noSCRnoRL/4_model/saved_model.pb
--- a/Models/CCD4PPI/noSCRnoRL/4_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/noSCRnoRL/4_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/noSCRnoRL/4_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/noSCRnoRL/4_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/noSCRnoRL/4_model/variables/variables.index
+++ b/Models/CCD4PPI/noSCRnoRL/4_model/variables/variables.index
--- a/Models/CCD4PPI/onlyCR/0_model/saved_model.pb
+++ b/Models/CCD4PPI/onlyCR/0_model/saved_model.pb
--- a/Models/CCD4PPI/onlyCR/0_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/onlyCR/0_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/onlyCR/0_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/onlyCR/0_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/onlyCR/0_model/variables/variables.index
+++ b/Models/CCD4PPI/onlyCR/0_model/variables/variables.index
--- a/Models/CCD4PPI/onlyCR/1_model/saved_model.pb
+++ b/Models/CCD4PPI/onlyCR/1_model/saved_model.pb
--- a/Models/CCD4PPI/onlyCR/1_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/onlyCR/1_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/onlyCR/1_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/onlyCR/1_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/onlyCR/1_model/variables/variables.index
+++ b/Models/CCD4PPI/onlyCR/1_model/variables/variables.index
--- a/Models/CCD4PPI/onlyCR/2_model/saved_model.pb
+++ b/Models/CCD4PPI/onlyCR/2_model/saved_model.pb
--- a/Models/CCD4PPI/onlyCR/2_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/onlyCR/2_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/onlyCR/2_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/onlyCR/2_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/onlyCR/2_model/variables/variables.index
+++ b/Models/CCD4PPI/onlyCR/2_model/variables/variables.index
--- a/Models/CCD4PPI/onlyCR/3_model/saved_model.pb
+++ b/Models/CCD4PPI/onlyCR/3_model/saved_model.pb
--- a/Models/CCD4PPI/onlyCR/3_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/onlyCR/3_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/onlyCR/3_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/onlyCR/3_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/onlyCR/3_model/variables/variables.index
+++ b/Models/CCD4PPI/onlyCR/3_model/variables/variables.index
--- a/Models/CCD4PPI/onlyCR/4_model/saved_model.pb
+++ b/Models/CCD4PPI/onlyCR/4_model/saved_model.pb
--- a/Models/CCD4PPI/onlyCR/4_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/onlyCR/4_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/onlyCR/4_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/onlyCR/4_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/onlyCR/4_model/variables/variables.index
+++ b/Models/CCD4PPI/onlyCR/4_model/variables/variables.index
--- a/Models/CCD4PPI/onlySC/0_model/saved_model.pb
+++ b/Models/CCD4PPI/onlySC/0_model/saved_model.pb
--- a/Models/CCD4PPI/onlySC/0_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/onlySC/0_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/onlySC/0_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/onlySC/0_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/onlySC/0_model/variables/variables.index
+++ b/Models/CCD4PPI/onlySC/0_model/variables/variables.index
--- a/Models/CCD4PPI/onlySC/1_model/saved_model.pb
+++ b/Models/CCD4PPI/onlySC/1_model/saved_model.pb
--- a/Models/CCD4PPI/onlySC/1_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/onlySC/1_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/onlySC/1_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/onlySC/1_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/onlySC/1_model/variables/variables.index
+++ b/Models/CCD4PPI/onlySC/1_model/variables/variables.index
--- a/Models/CCD4PPI/onlySC/2_model/saved_model.pb
+++ b/Models/CCD4PPI/onlySC/2_model/saved_model.pb
--- a/Models/CCD4PPI/onlySC/2_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/onlySC/2_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/onlySC/2_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/onlySC/2_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/onlySC/2_model/variables/variables.index
+++ b/Models/CCD4PPI/onlySC/2_model/variables/variables.index
--- a/Models/CCD4PPI/onlySC/3_model/saved_model.pb
+++ b/Models/CCD4PPI/onlySC/3_model/saved_model.pb
--- a/Models/CCD4PPI/onlySC/3_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/onlySC/3_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/onlySC/3_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/onlySC/3_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/onlySC/3_model/variables/variables.index
+++ b/Models/CCD4PPI/onlySC/3_model/variables/variables.index
--- a/Models/CCD4PPI/onlySC/4_model/saved_model.pb
+++ b/Models/CCD4PPI/onlySC/4_model/saved_model.pb
--- a/Models/CCD4PPI/onlySC/4_model/variables/variables.data-00000-of-00002
+++ b/Models/CCD4PPI/onlySC/4_model/variables/variables.data-00000-of-00002
--- a/Models/CCD4PPI/onlySC/4_model/variables/variables.data-00001-of-00002
+++ b/Models/CCD4PPI/onlySC/4_model/variables/variables.data-00001-of-00002
--- a/Models/CCD4PPI/onlySC/4_model/variables/variables.index
+++ b/Models/CCD4PPI/onlySC/4_model/variables/variables.index
--- a/Models/Dockground/0_model/saved_model.pb
+++ b/Models/Dockground/0_model/saved_model.pb
--- a/Models/Dockground/0_model/variables/variables.data-00000-of-00002
+++ b/Models/Dockground/0_model/variables/variables.data-00000-of-00002
--- a/Models/Dockground/0_model/variables/variables.data-00001-of-00002
+++ b/Models/Dockground/0_model/variables/variables.data-00001-of-00002
--- a/Models/Dockground/0_model/variables/variables.index
+++ b/Models/Dockground/0_model/variables/variables.index
--- a/Models/Dockground/1_model/saved_model.pb
+++ b/Models/Dockground/1_model/saved_model.pb
--- a/Models/Dockground/1_model/variables/variables.data-00000-of-00002
+++ b/Models/Dockground/1_model/variables/variables.data-00000-of-00002
--- a/Models/Dockground/1_model/variables/variables.data-00001-of-00002
+++ b/Models/Dockground/1_model/variables/variables.data-00001-of-00002
--- a/Models/Dockground/1_model/variables/variables.index
+++ b/Models/Dockground/1_model/variables/variables.index
--- a/Models/Dockground/2_model/saved_model.pb
+++ b/Models/Dockground/2_model/saved_model.pb
--- a/Models/Dockground/2_model/variables/variables.data-00000-of-00002
+++ b/Models/Dockground/2_model/variables/variables.data-00000-of-00002
--- a/Models/Dockground/2_model/variables/variables.data-00001-of-00002
+++ b/Models/Dockground/2_model/variables/variables.data-00001-of-00002
--- a/Models/Dockground/2_model/variables/variables.index
+++ b/Models/Dockground/2_model/variables/variables.index
--- a/Models/Dockground/3_model/saved_model.pb
+++ b/Models/Dockground/3_model/saved_model.pb
--- a/Models/Dockground/3_model/variables/variables.data-00000-of-00002
+++ b/Models/Dockground/3_model/variables/variables.data-00000-of-00002
--- a/Models/Dockground/3_model/variables/variables.data-00001-of-00002
+++ b/Models/Dockground/3_model/variables/variables.data-00001-of-00002
--- a/Models/Dockground/3_model/variables/variables.index
+++ b/Models/Dockground/3_model/variables/variables.index
--- a/README.md
+++ b/README.md
-![](Images/method5.svg.png?raw=true "DLA-Ranker")
\ No newline at end of file
+### Contents
+
+- [Overview](#overview)
+- [Requirements](#Requirements)
+- [Quick Tutorial](#Tutorial)
+- [Documentation](https://deeprank.readthedocs.io/)
+- [License](./LICENSE)
+- [Issues & Contributing](#Issues-and-Contributing)
+
+## Overview
+![](Images/method5.svg.png?raw=true "DLA-Ranker")
+
+Deep Local Analysis (DLA)-Ranker is a deep learning framework applying 3D convolutions to a set of locally oriented cubes representing the protein interface. It explicitly considers the local geometry of
+the interfacial residues along with their neighboring atoms and the regions of the interface with different solvent accessibility. DLA-Ranker identifies near-native conformations and discovers alternative interfaces from ensembles generated by molecular docking.
+
+#### Features:
+
+- Useful APIs for fast preprocessing of huge assembly of the complex conformations and classify them based on CAPRI criteria. 
+
+- Representation of an interface as a set of locally oriented cubes.
+   - *Atomic density map as a 3D gird.*
+   - *Structure class based on solvant accessibility (Support, Core, Rim).*
+   - *Information on Receptor and Ligand.*
+   - *Information of interfacial residues.*
+
+- Classification of docking conformations based on CAPRI criteria (Incorrect, Acceptable, Medium, High quality)
+
+- Fast generation of cubes and and evaluation of interface.
+
+- Training and testing 3D-CNN models.
+
+- Support various per-score aggregation schemes.
+   - *Considering only subset cubes for evaluation of interface.*
+   - *Residues from Support or Core regions.*
+   - *Residues from Core or Rim regions.*
+   - *Selecting residues exclusively from the receptor or from the ligand.*
+
+- Extraction of embeddings and the topology of the interface for graph representation learning.
+
+
+
+## Requirements
+
+#### Packages:
+
+DLA-Ranker can be run on Linux, MacOS, and Windows. We recommend to use DLA-Ranker on the machines with GPU. It requires following packages:
+- Python version 3.7 or 3.8.
+- Tensorflow version 2.2 or 2.3.
+- Cuda-Toolkit
+- Scikit-Learn, numpy pandas matplotlib lz4 and tqdm (conda install -c pytorch -c pyg -c conda-forge python=3.9 numpy pandas matplotlib tqdm pytorch pyg scikit-learn cuda-toolkit lz4).
+
+All-in-one: Run conda create --name dla-ranker --file dla-ranker.yml
+
+- For requirements of InteractionGNN please visit its Readme.
+
+## Tutorial
+
+DLA-Ranker works in two steps:
+
+-   Generating a set of locally orient cubes representing the interface.
+-   Running the deep learning framework to:
+   - *Train: creating a new model.*
+   - *Test: Evaluating conformations using trained models.*
+   - *Encode: Extracting embeddings and the topology of the interface.*
+
+### Generating locally oriented cubes
+
+#### Dataset of conformations:
+Place the ensemble of conformations in a directory (*e.g. 'Examples/conformations_directory'*) like below: 
+
+```
+Example
+|___conformations_list
+|
+|___conformations_directory
+    |
+    |___target complex 1
+    |   |   Conformation 1
+    |   |   Conformation 2
+    |   |   ...
+    |
+    |___target complex 2
+    |   |   Conformation 1
+    |   |   Conformation 2
+    |   |   ...
+    |
+    ..........
+```
+
+'conformations_list' is a csv file that contains five columns separated by ';': Name of target complex (`Comp`); receptor chain ID(s) (`ch1`), ligand chain ID(s) (`ch2`); Name of the conformation file (`Conf`); class of the conformation (`Class`, 0:incorrect, 1: near-native).
+
+
+#### Processing the conformations
+From directory 'Representation' run: ```python generate_cubes.py```
+
+The output will be directory 'map_dir' with the following structure:
+
+```
+Example
+|___map_dir
+    |___target complex 1
+    |   |___0
+    |   |   |   conformation 1
+    |   |   |   conformation 2
+    |   |
+    |   |___1
+    |       |   conformation 3
+    |       |   conformation 4
+    |   
+    |___target complex 2
+    |   |___0
+    |   |   |   conformation 1
+    |   |   |   conformation 2
+    |   |
+    |   |___1
+    |       |   conformation 3
+    |       |   conformation 4
+    ..........
+```
+
+Each output represents interface of a conformation and contains a set of local environments (*e.g. atomic density map, structure classes (S,C,R), topology of the interface, ...*)
+
+### Deep learning framework
+
+Following commands will use the trained models that can be found in the directory 'Models'. This directory includes 3 sets of models:
+
+'BM5': 10 models generated following 10-fold cross validation procedure on the 142 dimers of the Docking Benchmakr version 5. The docking conformations had been generated by HADDOCK.
+'Dockground': 4 models generated following 4-fold cross validation procedure on the 59 target complexes of the Dockground database. The docking conformations had been generated by GRAMM.
+'CCD4PPI': 5 models generated following 5-fold cross validation procedure on the 400 target complexes.
+
+For detailed information please read the article.
+
+#### Evaluation of interfaces
+From directory 'Test' run ```python test.py```
+It processes all the target complexes and their conformations and produces csv file 'predictions_SCR'. Each row of the output file belongs to a conformation and it has 9 columns separated by 'tab':
+
+Name of target complex and the conformation (`Conf`) 
+Fold Id (`Fold`)
+Score of each residue (`Scores`)
+Region (SCR) of each residue (`Regions`)
+Global averaged score of the interface (`Score`)
+Processing time (`Time`)
+Class of the conformation (`Class`, 0:incorrect, 1: near-native)
+Partner (`RecLig`)
+Residue number (`ResNumber`; according to PDB)
+
+One can associate the Residues' numbers, regions, scores, and partner to evaluate the interface on a subset of interfacial residues.
+
+#### Extraction of the embeddings
+From directory 'Test' run ```python extract_embeddings.py```
+It extracts embeddings and the topology for each given interface and write them in a an output file with the same name. Each row in a file belongs to a residue and includes the its coordinates, its region, and its embedding vector. These files can be used for aggregation of embeddings based on graph-learning.
+
--- a/Representation/generate_cubes.py
+++ b/Representation/generate_cubes.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Sun Jan  2 20:21:06 2022
+
+@author: yasser
+"""
+
+import logging
+import os
+import sys
+from os import path, mkdir, getenv, listdir, remove, system, stat
+import pandas as pd
+import numpy as np
+from prody import *
+import glob
+import shutil
+#import matplotlib.pyplot as plt
+import seaborn as sns
+from math import exp
+from subprocess import CalledProcessError, check_call, call
+import traceback
+from random import shuffle, random, seed, sample
+from numpy import newaxis
+import matplotlib.pyplot as plt
+import time
+from prody import *
+import collections
+import scr
+from numpy import asarray
+from sklearn.preprocessing import OneHotEncoder
+import subprocess
+import deepScoring.load_data as load
+from sklearn.preprocessing import MinMaxScaler
+
+logging.basicConfig(filename='manager.log', filemode='w', format='%(levelname)s: %(message)s', level=logging.DEBUG)
+mainlog = logging.getLogger('main')
+logging.Logger
+
+sys.path.insert(1, '../lib/')
+import tools as tl
+
+confom_dict = pd.read_csv('../Examples/conformations_list', sep=';')    
+comp_dir = '../Examples/conformations_directory'
+target_comp = listdir(comp_dir)
+
+map_dir = '../Examples/map_dir'
+if not path.exists(map_dir):
+    mkdir(map_dir)
+
+
+bin_path = "./mapsGenerator/build/maps_generator"
+v_dim = 24
+
+def mapcomplex(file, pose_class, ch1, ch2, pair, pose):
+    try:
+        name = pair+'_'+str(pose)
+        
+        rec = parsePDB(file).select('protein').select('chain ' + ch1[0])
+        rec.setChids('R')
+        lig = parsePDB(file).select('protein').select('chain ' + ch2[0])
+        lig.setChids('L')    
+        
+        writePDB(name+'_r.pdb', rec.toAtomGroup())
+        writePDB(name+'_l.pdb', lig.toAtomGroup())
+        writePDB(name+'_complex.pdb', rec.toAtomGroup() + lig.toAtomGroup())
+        
+        scr.get_scr(name+'_r.pdb', name+'_l.pdb', name+'_complex.pdb', name)
+        
+        rimcoresup = pd.read_csv(name+'_rimcoresup.csv', header=None, sep=' ')
+        rec_regions = rimcoresup.loc[rimcoresup[4] == 'receptor']
+        rec_regions = pd.Series(rec_regions[5].values, index=rec_regions[2]).to_dict()
+        lig_regions = rimcoresup.loc[rimcoresup[4] == 'ligand']
+        lig_regions = pd.Series(lig_regions[5].values, index=lig_regions[2]).to_dict()
+        
+        res_num2name_map_rec = dict(zip(rec.getResnums(),rec.getResnames()))
+        res_num2name_map_lig = dict(zip(lig.getResnums(),lig.getResnames()))
+        res_num2coord_map_rec = dict(zip(rec.select('ca').getResnums(),rec.select('ca').getCoords()))
+        res_num2coord_map_lig = dict(zip(lig.select('ca').getResnums(),lig.select('ca').getCoords()))
+        
+        L1 = list(set(rec.getResnums()))
+        res_ind_map_rec = dict([(x,inx) for inx, x in enumerate(sorted(L1))])
+        L1 = list(set(lig.getResnums()))
+        res_ind_map_lig = dict([(x,inx+len(res_ind_map_rec)) for inx, x in enumerate(sorted(L1))])
+        
+        res_inter_rec = [(res_ind_map_rec[x], rec_regions[x], x, 'R', res_num2name_map_rec[x], res_num2coord_map_rec[x]) 
+                          for x in sorted(list(rec_regions.keys())) if x in res_ind_map_rec]
+        res_inter_lig = [(res_ind_map_lig[x], lig_regions[x], x, 'L', res_num2name_map_lig[x], res_num2coord_map_lig[x])
+                          for x in sorted(list(lig_regions.keys())) if x in res_ind_map_lig]
+        
+        reg_type =  list(map(lambda x: x[1],res_inter_rec)) + list(map(lambda x: x[1],res_inter_lig))
+        res_name =  list(map(lambda x: [x[4]],res_inter_rec)) + list(map(lambda x: [x[4]],res_inter_lig))
+        res_pos =  list(map(lambda x: x[5],res_inter_rec)) + list(map(lambda x: x[5],res_inter_lig))
+
+
+        #Merge these two files!
+        with open('resinfo','w') as fh_res:
+            for x in res_inter_rec:
+                fh_res.write(str(x[2])+';'+x[3]+'\n')
+            for x in res_inter_lig:
+                fh_res.write(str(x[2])+';'+x[3]+'\n')  
+
+        with open('scrinfo','w') as fh_csr:
+            for x in res_inter_rec:
+                fh_csr.write(str(x[2])+';'+x[3]+';'+x[1]+'\n')
+            for x in res_inter_lig:
+                fh_csr.write(str(x[2])+';'+x[3]+';'+x[1]+'\n')
+    
+        if not res_inter_rec or not res_inter_lig:
+            return [],[],[]
+        
+        #tl.coarse_grain_pdb('train.pdb')
+        mapcommand = [bin_path, "--mode", "map", "-i", name+'_complex.pdb', "--native", "-m", str(v_dim), "-t", "167", "-v", "0.8", "-o", name+'_complex.bin']
+        call(mapcommand)
+        dataset_train = load.read_data_set(name+'_complex.bin')
+        
+        print(dataset_train.maps.shape)
+        
+        #scaler = MinMaxScaler()
+        #scaler.fit(dataset_train.maps)
+        #data_norm = scaler.transform(dataset_train.maps)
+        data_norm = dataset_train.maps
+        
+        X = np.reshape(data_norm, (-1,v_dim,v_dim,v_dim,173))
+        y = [int(pose_class)]*(len(res_inter_rec) + len(res_inter_lig))
+        
+        map_name = path.join(map_dir, pair, pose_class, name)
+        tl.save_obj((X,y,reg_type,res_pos,res_name,res_inter_rec+res_inter_lig), map_name)
+        
+        check_call(
+            [
+                'lz4', '-f',   #, '--rm' because if inconsistency in lz4 versions! 
+                map_name + '.pkl'
+            ],
+            stdout=sys.stdout)
+        remove(map_name + '.pkl')
+        remove(name+'_complex.bin')
+        remove(name+'_r.pdb')
+        remove(name+'_l.pdb')
+        remove(name+'_complex.pdb')
+        remove(name+'_rimcoresup.csv')
+        
+        
+        print(type(X))
+        print(X.shape)
+        
+        
+    except Exception as e:
+        logging.info("Bad interface!" + '\nError message: ' + str(e) + 
+                      "\nMore information:\n" + traceback.format_exc())
+        return [],[],[]
+    
+    return X, y, reg_type
+
+
+already_exist = listdir(map_dir)
+def process_targetcomplex(targetcomplex, comp_dir, report_dict):
+    try:
+        if targetcomplex in already_exist:
+            return
+        logging.info('Processing ' + targetcomplex + ' ...')
+        pos_path = path.join(map_dir, targetcomplex, '1')
+        neg_path = path.join(map_dir, targetcomplex, '0')
+        if not path.exists(path.join(map_dir, targetcomplex)):
+            mkdir(path.join(map_dir, targetcomplex))
+        if not path.exists(pos_path):
+            mkdir(pos_path)    
+        if not path.exists(neg_path):
+            mkdir(neg_path)
+
+        good_poses = confom_dict.loc[(confom_dict.Comp == targetcomplex) & (confom_dict.Class == 1)].Conf.to_list()
+        bad_poses = confom_dict.loc[(confom_dict.Comp == targetcomplex) & (confom_dict.Class == 0)].Conf.to_list()
+        ch1, ch2 =  confom_dict.loc[confom_dict.Comp == targetcomplex][['ch1','ch2']].iloc[0]
+        for pose in good_poses:
+            file = path.join(comp_dir, pose + '.pdb')
+            mapcomplex(file, '1', ch1, ch2, targetcomplex, path.basename(pose))
+        for pose in bad_poses:
+            file = path.join(comp_dir, pose)
+            mapcomplex(file, '0', ch1, ch2, targetcomplex, path.basename(pose))
+    
+    except Exception as e:
+        logging.info("Bad target complex!" + '\nError message: ' + str(e) + 
+                      "\nMore information:\n" + traceback.format_exc())
+
+def manage_pair_files(use_multiprocessing):
+    tc_cases = []
+    for tc in target_comp:
+        tc_cases.append((tc, comp_dir))
+    report_dict = tl.do_processing(tc_cases, process_targetcomplex, use_multiprocessing)
+    return report_dict
+
+report_dict = manage_pair_files(False)
\ No newline at end of file
--- a/Representation/maps_generator
+++ b/Representation/maps_generator
--- a/Test/extract_embeddings.py
+++ b/Test/extract_embeddings.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Sun Dec 13 21:09:56 2020
+
+@author: yasser
+"""
+
+import logging
+import os
+import sys
+import gc
+from os import path, mkdir, getenv, listdir, remove, system, stat
+import pandas as pd
+import numpy as np
+#from prody import *
+import glob
+import shutil
+#import matplotlib.pyplot as plt
+import seaborn as sns
+from math import exp
+import subprocess
+from subprocess import CalledProcessError, check_call
+import traceback
+from random import shuffle, random, seed, sample
+from numpy import newaxis
+import matplotlib.pyplot as plt
+import time
+
+import collections
+#import scr
+from numpy import asarray
+from sklearn.preprocessing import OneHotEncoder
+
+import tensorflow.keras
+from tensorflow.keras import backend as K
+from tensorflow.keras import regularizers
+from tensorflow.keras.datasets import mnist # subroutines for fetching the MNIST dataset
+from tensorflow.keras.models import Model, Sequential,load_model # basic class for specifying and training a neural network
+from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Dense, Dropout, Activation, Flatten, AveragePooling3D
+#from tensorflow.keras.utils import np_utils # utilities for one-hot encoding of ground truth values
+
+from tensorflow.keras.layers import Dot
+from tensorflow.keras.backend import ones, ones_like
+from tensorflow.keras.optimizers import Adam
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import mean_squared_error, accuracy_score, roc_auc_score, roc_curve, precision_recall_curve
+from sklearn.preprocessing import MinMaxScaler
+import pickle
+
+print('Your python version: {}'.format(sys.version_info.major))
+USE_TENSORFLOW_AS_BACKEND = True
+# IF YOU *DO* HAVE AN Nvidia GPU on your computer, or execute on Google COLAB, then change below to False!
+FORCE_CPU = False #False 
+if USE_TENSORFLOW_AS_BACKEND:
+    os.environ['KERAS_BACKEND'] = 'tensorflow'
+else:
+    os.environ['KERAS_BACKEND'] = 'theano'
+if FORCE_CPU:
+    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
+    os.environ["CUDA_VISIBLE_DEVICES"] = ""
+if USE_TENSORFLOW_AS_BACKEND == True:
+    import tensorflow as tf
+    print('Your tensorflow version: {}'.format(tf.__version__))
+    print("GPU : "+tf.test.gpu_device_name())
+    physical_devices = tf.config.experimental.list_physical_devices('GPU')
+    tf.config.experimental.set_memory_growth(physical_devices[0], True)
+else:
+    import theano
+    print('Your theano version: {}'.format(theano.__version__))
+
+logging.basicConfig(filename='manager.log', filemode='w', format='%(levelname)s: %(message)s', level=logging.DEBUG)
+mainlog = logging.getLogger('main')
+logging.Logger
+
+seed(int(np.round(np.random.random()*10)))
+#################################################################################################
+
+def save_obj(obj, name):
+    with open(name + '.pkl', 'wb') as f:
+        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
+
+def load_obj(name):
+    with open(name + '.pkl', 'rb') as f:
+        return pickle.load(f)
+
+v_dim = 24
+
+map_dir = '../Examples/map_dir'
+intermediate_dir = '../Examples/intermediate'
+
+samples_test = listdir(map_dir)
+
+if path.isdir(intermediate_dir):
+    shutil.rmtree(intermediate_dir)
+mkdir(intermediate_dir)
+
+model = load_model(path.join('../Examples', 'MODELS', 'Dockground', '0_model'))
+
+
+#Must be update!
+def load_map(train_path):
+    """
+    check_call(
+        [
+            'lz4_win64_v1_9_3\lz4.exe', '-d', '-f',
+            train_path
+        ],
+        stdout=sys.stdout)
+    """
+    print(train_path)
+    #X_train, y_train, reg_type, _,_,_ = load_obj(train_path.replace('.pkl.lz4',''))
+    X_train, y_train, reg_type, res_pos,_,_ = load_obj(train_path.replace('.pkl',''))
+    #remove(train_path.replace('.lz4',''))
+    return X_train, y_train, reg_type, res_pos
+        
+batch_samples_test_1 = []
+batch_samples_test_0 = []
+for pair in samples_test:
+    batch_samples_test_1 += glob.glob(path.join(map_dir,pair,'1','*'))
+    batch_samples_test_0 += glob.glob(path.join(map_dir,pair,'0','*'))
+    mkdir(path.join(intermediate_dir, pair))
+    mkdir(path.join(intermediate_dir, pair, '1'))
+    mkdir(path.join(intermediate_dir, pair, '0'))
+
+batch_samples_test = batch_samples_test_1 + batch_samples_test_0
+
+encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
+onehot = encoder.fit(np.asarray([['S'], ['C'], ['R']]))
+
+for test_interface in batch_samples_test:
+    try:
+        print(test_interface)
+        X_test, y_test, reg_type, res_pos = load_map(test_interface)
+        X_aux = encoder.transform(list(map(lambda x: [x], reg_type)))
+        if len(X_test) == 0 or len(X_aux) != len(X_test):
+            continue
+    except Exception as e:
+        logging.info("Bad interface!" + '\nError message: ' + str(e) + 
+                      "\nMore information:\n" + traceback.format_exc())
+        continue
+        
+    intermediate_model = Model(inputs=model.input, outputs=model.get_layer('layer1').output)
+    intermediate_prediction = intermediate_model.predict([X_test, X_aux], batch_size=X_test.shape[0])
+    _ = gc.collect()
+
+    with open(test_interface.replace(map_dir, intermediate_dir).replace('.pkl','.graph'),'w') as f_handler_graph:
+        for i in range(len(X_test)):
+            f_handler_graph.write(','.join(list(map(lambda x: str(x), res_pos[i]))) + ',' +
+                                  reg_type[i] + ',' +
+                                  ','.join(list(map(lambda x: str(x), intermediate_prediction[i]))) + '\n')
\ No newline at end of file
--- a/Test/test.py
+++ b/Test/test.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Sun Dec 13 21:09:56 2020
+
+@author: yasser
+"""
+
+import logging
+import os
+import sys
+import gc
+from os import path, mkdir, getenv, listdir, remove, system, stat
+import pandas as pd
+import numpy as np
+import glob
+
+import seaborn as sns
+from math import exp
+from subprocess import CalledProcessError, check_call
+import traceback
+from random import shuffle, random, seed, sample
+from numpy import newaxis
+import matplotlib.pyplot as plt
+import time
+
+from numpy import asarray
+from sklearn.preprocessing import OneHotEncoder
+
+import tensorflow.keras
+from tensorflow.keras import backend as K
+from tensorflow.keras.models import load_model
+
+from tensorflow.keras.layers import Dot
+from tensorflow.keras.backend import ones, ones_like
+from tensorflow.keras.optimizers import Adam
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import mean_squared_error, accuracy_score, roc_auc_score, roc_curve, precision_recall_curve
+from sklearn.preprocessing import MinMaxScaler
+import pickle
+
+print('Your python version: {}'.format(sys.version_info.major))
+USE_TENSORFLOW_AS_BACKEND = True
+# IF YOU *DO* HAVE AN Nvidia GPU on your computer, or execute on Google COLAB, then change below to False!
+FORCE_CPU = False #False 
+if USE_TENSORFLOW_AS_BACKEND:
+    os.environ['KERAS_BACKEND'] = 'tensorflow'
+else:
+    os.environ['KERAS_BACKEND'] = 'theano'
+if FORCE_CPU:
+    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
+    os.environ["CUDA_VISIBLE_DEVICES"] = ""
+if USE_TENSORFLOW_AS_BACKEND == True:
+    import tensorflow as tf
+    print('Your tensorflow version: {}'.format(tf.__version__))
+    print("GPU : "+tf.test.gpu_device_name())
+    physical_devices = tf.config.experimental.list_physical_devices('GPU')
+    tf.config.experimental.set_memory_growth(physical_devices[0], True)
+else:
+    import theano
+    print('Your theano version: {}'.format(theano.__version__))
+
+logging.basicConfig(filename='manager.log', filemode='w', format='%(levelname)s: %(message)s', level=logging.DEBUG)
+mainlog = logging.getLogger('main')
+logging.Logger
+
+seed(int(np.round(np.random.random()*10)))
+#################################################################################################
+
+def save_obj(obj, name):
+    with open(name + '.pkl', 'wb') as f:
+        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
+
+def load_obj(name):
+    with open(name + '.pkl', 'rb') as f:
+        return pickle.load(f)
+
+v_dim = 24
+
+map_dir = '../Examples/map_dir'
+
+samples_test = listdir(map_dir)
+
+#model = load_model(path.join('../Examples', 'MODELS', 'Dockground', '0_model'))
+model = load_model(path.join('../Examples', 'MODELS', 'BM5', '0_model'))
+
+predictions_file = open('../Examples/predictions_SCR', 'w')
+
+#Must be update!
+def load_map(train_path):
+    """
+    check_call(
+        [
+            'lz4_win64_v1_9_3\lz4.exe', '-d', '-f',
+            train_path
+        ],
+        stdout=sys.stdout)
+    """
+    print(train_path)
+    #X_train, y_train, reg_type, _,_,_ = load_obj(train_path.replace('.pkl.lz4',''))
+    X_train, y_train, reg_type, res_pos,_,info = load_obj(train_path.replace('.pkl',''))
+    #X_train = X_train[:,:,:,:,:167]
+
+    """
+    X_train_tmp = []
+    reg_type_tmp = []
+    y_train_tmp = []
+    for i in range(min(len(X_train),len(reg_type))):
+        if reg_type[i] == 'C' or reg_type[i] == 'R':
+            X_train_tmp.append(X_train[i])
+            reg_type_tmp.append(reg_type[i])
+            y_train_tmp.append(y_train[i])
+    reg_type = reg_type_tmp
+    X_train = np.array(X_train_tmp)
+    y_train = y_train_tmp
+    """
+
+    #remove(train_path.replace('.lz4',''))
+    return X_train, y_train, reg_type, res_pos, info
+
+fold = 'test'
+batch_samples_test_1 = []
+batch_samples_test_0 = []
+for pair in samples_test:
+    batch_samples_test_1 += glob.glob(path.join(map_dir,pair,'1','*'))
+    batch_samples_test_0 += glob.glob(path.join(map_dir,pair,'0','*'))
+batch_samples_test = batch_samples_test_1 + batch_samples_test_0
+
+encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
+onehot = encoder.fit(np.asarray([['S'], ['C'], ['R']]))
+
+
+predictions_file.write('Conf' + '\t' +
+                       'Fold' + '\t' +
+                       'Scores' + '\t' +
+                       'Regions' + '\t' +
+                       'Score' + '\t' +
+                       'Time' + '\t' +
+                       'Class' + '\t' +
+                       'RecLig' + '\t' +
+                       'ResNumber' + '\n')
+
+for test_interface in batch_samples_test:
+    try:
+        print(test_interface)
+        X_test, y_test, reg_type, res_pos, info = load_map(test_interface)
+        X_aux = encoder.transform(list(map(lambda x: [x], reg_type)))
+        if len(X_test) == 0 or len(X_aux) != len(X_test):
+            continue
+    except Exception as e:
+        logging.info("Bad target complex!" + '\nError message: ' + str(e) + 
+                     "\nMore information:\n" + traceback.format_exc())
+        continue
+        
+    start = time.time()
+    all_scores = model.predict([X_test, X_aux], batch_size=X_test.shape[0])
+    end = time.time()
+    _ = gc.collect()
+
+    test_preds = all_scores.mean()
+    print(y_test)
+    predictions_file.write(test_interface + '\t' +
+                        str(fold) + '\t' +
+                        ','.join(list(map(lambda x: str(x[0]), all_scores))) + '\t' +
+                        ','.join(reg_type) + '\t' +
+                        str(test_preds) + '\t' +
+                        str(end-start) + '\t' +
+                        str(y_test[0]) + '\t' +
+                        ','.join(list(map(lambda x: x[3], info))) + '\t' +
+                        ','.join(list(map(lambda x: str(x[2]), info))) + '\n')
+predictions_file.close()