DTA-GNN¶
Target-Specific Binding Affinity Dataset Builder and GNN Trainer
Overview¶
DTA-GNN is a comprehensive toolkit for building high-quality target-specific binding affinity datasets from the ChEMBL database and training Graph Neural Networks for your targets of interest. It provides end-to-end support for dataset construction, from data ingestion to model training, with a strong focus on reproducibility and preventing data leakage.
Key Features¶
-
Flexible Data Ingestion
Support for both ChEMBL web API and local SQLite database dumps for maximum flexibility.
-
Rigorous Data Cleaning
Automated unit standardization, deduplication, pChEMBL conversion, and censored value handling.
-
Smart Dataset Splitting
Scaffold, Random, and Temporal splits to prevent data leakage.
-
Leakage Audits
Built-in checks for train/test contamination to ensure valid model evaluation.
-
Multiple Interfaces
CLI, Python API, and an interactive Gradio web UI for dataset construction.
-
Model Training
Train baseline models (RandomForest, SVR) and 10 GNN architectures (GIN, GCN, GAT, GraphSAGE, PNA, Transformer, TAG, ARMA, Cheb, SuperGAT) directly on your curated datasets.
Quick Example¶
from dta_gnn.pipeline import Pipeline
# Initialize pipeline with SQLite database
pipeline = Pipeline(source_type="sqlite", sqlite_path="chembl_36.db")
# Build a DTA regression dataset
df = pipeline.build_dta(
target_ids=["CHEMBL204", "CHEMBL205"], # Specific targets
split_method="scaffold", # Scaffold-based split
test_size=0.2,
val_size=0.1
)
print(f"Dataset shape: {df.shape}")
print(df.head())
Installation¶
GNN support and Weights & Biases (W&B) are included in the default install. For development (testing, linting), use pip install -e ".[dev]".
Supported Task Types¶
| Task Type | Description | Label |
|---|---|---|
| DTA (Regression) | Target-Specific Binding Affinity prediction | Continuous pChEMBL value |
Supported Models¶
| Model | Features | Task Type | Speed |
|---|---|---|---|
| Random Forest | Morgan FP (ECFP4) | Regression | Fast |
| SVR | Morgan FP (ECFP4) | Regression | Medium |
| GNN | 2D Molecular Graphs | Regression | Slow |
GNN Architectures (10 total): - GIN, GCN, GAT, GraphSAGE, PNA, Transformer, TAG, ARMA, Cheb, SuperGAT
Architecture¶
graph LR
A[ChEMBL Data] --> B[Data Source]
B --> C[Cleaning & Normalization]
C --> D[Labeling]
D --> E[Splitting]
E --> F[Leakage Audit]
F --> G[Dataset]
G --> H[Model Training]
H --> I[Predictions]
Citation¶
If you use DTA-GNN in your research, please cite:
@article{ozsari2026dta,
title = {DTA-GNN: a toolkit for constructing target-specific drug--target affinity datasets and training graph neural networks},
author = {Özsari, Gökhan and Rifaioğlu, Ahmet Süreyya and Acar, Aybar Can and Doğan, Tunca and Atalay, M Volkan},
journal = {SoftwareX},
volume = {34},
pages = {102671},
year = {2026},
publisher = {Elsevier}
}
License¶
DTA-GNN is released under the MIT License. See the LICENSE file for details.