DTA-GNN¶

Target-Specific Binding Affinity Dataset Builder and GNN Trainer

Overview¶

DTA-GNN is a comprehensive toolkit for building high-quality target-specific binding affinity datasets from the ChEMBL database and training Graph Neural Networks for your targets of interest. It provides end-to-end support for dataset construction, from data ingestion to model training, with a strong focus on reproducibility and preventing data leakage.

Key Features¶

Flexible Data Ingestion

Support for both ChEMBL web API and local SQLite database dumps for maximum flexibility.

Data Sources
Rigorous Data Cleaning

Automated unit standardization, deduplication, pChEMBL conversion, and censored value handling.

Cleaning
Smart Dataset Splitting

Scaffold, Random, and Temporal splits to prevent data leakage.

Splits
Leakage Audits

Built-in checks for train/test contamination to ensure valid model evaluation.

Audits
Multiple Interfaces

CLI, Python API, and an interactive Gradio web UI for dataset construction.

Interfaces
Model Training

Train baseline models (RandomForest, SVR) and 10 GNN architectures (GIN, GCN, GAT, GraphSAGE, PNA, Transformer, TAG, ARMA, Cheb, SuperGAT) directly on your curated datasets.

Models

Quick Example¶

Python APICLIWeb UI

from dta_gnn.pipeline import Pipeline

# Initialize pipeline with SQLite database
pipeline = Pipeline(source_type="sqlite", sqlite_path="chembl_36.db")

# Build a DTA regression dataset
df = pipeline.build_dta(
    target_ids=["CHEMBL204", "CHEMBL205"],  # Specific targets
    split_method="scaffold",                 # Scaffold-based split
    test_size=0.2,
    val_size=0.1
)

print(f"Dataset shape: {df.shape}")
print(df.head())

# Dataset building is via Python API or Web UI. CLI provides setup and UI:
dta_gnn setup --version 36 --dir ./chembl_dbs   # Optional: download ChEMBL DB
dta_gnn ui                                       # Launch the Web UI

# Launch the interactive web interface
dta_gnn ui

Installation¶

pip install dta-gnn

GNN support and Weights & Biases (W&B) are included in the default install. For development (testing, linting), use pip install -e ".[dev]".

Complete Installation Guide

Supported Task Types¶

Task Type	Description	Label
DTA (Regression)	Target-Specific Binding Affinity prediction	Continuous pChEMBL value

Supported Models¶

Model	Features	Task Type	Speed
Random Forest	Morgan FP (ECFP4)	Regression	Fast
SVR	Morgan FP (ECFP4)	Regression	Medium
GNN	2D Molecular Graphs	Regression	Slow

GNN Architectures (10 total): - GIN, GCN, GAT, GraphSAGE, PNA, Transformer, TAG, ARMA, Cheb, SuperGAT

Architecture¶

graph LR
    A[ChEMBL Data] --> B[Data Source]
    B --> C[Cleaning & Normalization]
    C --> D[Labeling]
    D --> E[Splitting]
    E --> F[Leakage Audit]
    F --> G[Dataset]
    G --> H[Model Training]
    H --> I[Predictions]

Citation¶

If you use DTA-GNN in your research, please cite:

@article{ozsari2026dta,
  title   = {DTA-GNN: a toolkit for constructing target-specific drug--target affinity datasets and training graph neural networks},
  author  = {Özsari, Gökhan and Rifaioğlu, Ahmet Süreyya and Acar, Aybar Can and Doğan, Tunca and Atalay, M Volkan},
  journal = {SoftwareX},
  volume  = {34},
  pages   = {102671},
  year    = {2026},
  publisher = {Elsevier}
}

License¶

DTA-GNN is released under the MIT License. See the LICENSE file for details.