Skip to content

DTA-GNN

DTA-GNN Logo

Target-Specific Binding Affinity Dataset Builder and GNN Trainer


Overview

DTA-GNN is a comprehensive toolkit for building high-quality target-specific binding affinity datasets from the ChEMBL database and training Graph Neural Networks for your targets of interest. It provides end-to-end support for dataset construction, from data ingestion to model training, with a strong focus on reproducibility and preventing data leakage.

Key Features

  • Flexible Data Ingestion


    Support for both ChEMBL web API and local SQLite database dumps for maximum flexibility.

    Data Sources

  • Rigorous Data Cleaning


    Automated unit standardization, deduplication, pChEMBL conversion, and censored value handling.

    Cleaning

  • Smart Dataset Splitting


    Scaffold, Random, and Temporal splits to prevent data leakage.

    Splits

  • Leakage Audits


    Built-in checks for train/test contamination to ensure valid model evaluation.

    Audits

  • Multiple Interfaces


    CLI, Python API, and an interactive Gradio web UI for dataset construction.

    Interfaces

  • Model Training


    Train baseline models (RandomForest, SVR) and 10 GNN architectures (GIN, GCN, GAT, GraphSAGE, PNA, Transformer, TAG, ARMA, Cheb, SuperGAT) directly on your curated datasets.

    Models

Quick Example

from dta_gnn.pipeline import Pipeline

# Initialize pipeline with SQLite database
pipeline = Pipeline(source_type="sqlite", sqlite_path="chembl_36.db")

# Build a DTA regression dataset
df = pipeline.build_dta(
    target_ids=["CHEMBL204", "CHEMBL205"],  # Specific targets
    split_method="scaffold",                 # Scaffold-based split
    test_size=0.2,
    val_size=0.1
)

print(f"Dataset shape: {df.shape}")
print(df.head())
# Dataset building is via Python API or Web UI. CLI provides setup and UI:
dta_gnn setup --version 36 --dir ./chembl_dbs   # Optional: download ChEMBL DB
dta_gnn ui                                       # Launch the Web UI
# Launch the interactive web interface
dta_gnn ui

Installation

pip install dta-gnn

GNN support and Weights & Biases (W&B) are included in the default install. For development (testing, linting), use pip install -e ".[dev]".

Complete Installation Guide

Supported Task Types

Task Type Description Label
DTA (Regression) Target-Specific Binding Affinity prediction Continuous pChEMBL value

Supported Models

Model Features Task Type Speed
Random Forest Morgan FP (ECFP4) Regression Fast
SVR Morgan FP (ECFP4) Regression Medium
GNN 2D Molecular Graphs Regression Slow

GNN Architectures (10 total): - GIN, GCN, GAT, GraphSAGE, PNA, Transformer, TAG, ARMA, Cheb, SuperGAT

Architecture

graph LR
    A[ChEMBL Data] --> B[Data Source]
    B --> C[Cleaning & Normalization]
    C --> D[Labeling]
    D --> E[Splitting]
    E --> F[Leakage Audit]
    F --> G[Dataset]
    G --> H[Model Training]
    H --> I[Predictions]

Citation

If you use DTA-GNN in your research, please cite:

@article{ozsari2026dta,
  title   = {DTA-GNN: a toolkit for constructing target-specific drug--target affinity datasets and training graph neural networks},
  author  = {Özsari, Gökhan and Rifaioğlu, Ahmet Süreyya and Acar, Aybar Can and Doğan, Tunca and Atalay, M Volkan},
  journal = {SoftwareX},
  volume  = {34},
  pages   = {102671},
  year    = {2026},
  publisher = {Elsevier}
}

License

DTA-GNN is released under the MIT License. See the LICENSE file for details.