Quick Start¶
Get up and running with DTA-GNN in minutes. This guide walks you through building your first target-specific binding affinity (DTA) dataset and training a model on it.
Prerequisites¶
Ensure DTA-GNN is installed:
(Or install from source — see Installation.)
Option 0: One-call end-to-end (CLI)¶
If you just want a trained, evaluated GNN for a UniProt accession, the
fastest path is the train-gnn command:
# Optionally download a local ChEMBL SQLite database first (much faster):
dta_gnn setup --version 36 --dir ./chembl_dbs
# Run the full pipeline (UniProt → ChEMBL → scaffold split → W&B HPO → final GNN)
dta_gnn train-gnn P00533 \
--architecture gin \
--sqlite-path ./chembl_dbs/chembl_36.db \
--n-trials 20 --epochs 30
The CLI prints a per-step timing summary and the test-set RMSE / MAE / R² when the run completes. See End-to-End Pipeline.
Option 1: Web Interface (Recommended for Beginners)¶
The easiest way to get started is with the interactive web UI:
This launches a Gradio-based web application at http://127.0.0.1:7860.
Building a Dataset via UI¶
-
Select Data Source: Choose "SQLite" and provide the path to your ChEMBL database, or use "Web API" (slower but no setup required)
-
Specify Targets: Enter ChEMBL target IDs (e.g.,
CHEMBL204, CHEMBL205) or leave empty to fetch all -
Configure Settings:
- Task Type: Regression (DTA) for continuous binding affinity prediction
- Split Method: Scaffold, Random, or Temporal
-
Test/Validation sizes
-
Build Dataset: Click "Build Dataset" and wait for processing
-
Review Results: Examine the dataset preview, visualizations, and statistics
-
Download: Export your dataset as CSV or ZIP archive
Option 2: Python API (scripting)¶
For scripting and automation, use the Python API:
Build a DTA Dataset¶
from dta_gnn.pipeline import Pipeline
pipeline = Pipeline(source_type="sqlite", sqlite_path="./chembl_dbs/chembl_36.db")
df = pipeline.build_dta(
target_ids=["CHEMBL204", "CHEMBL205"],
split_method="scaffold",
test_size=0.2,
val_size=0.1,
output_path="dataset_dta.csv"
)
Parameters explained:
| Parameter | Description |
|---|---|
target_ids |
List of ChEMBL target IDs (e.g. ["CHEMBL204", "CHEMBL205"]) |
split_method |
Split strategy: "random", "scaffold", or "temporal" |
test_size |
Fraction for test set (default: 0.2) |
val_size |
Fraction for validation set (default: 0.1) |
output_path |
Optional path to save dataset CSV |
source_type / sqlite_path |
Pipeline init: "web" or "sqlite" and path to .db |
Option 3: Python API (full workflow)¶
For maximum flexibility and integration into your workflows:
from dta_gnn.pipeline import Pipeline
# Initialize with SQLite database
pipeline = Pipeline(
source_type="sqlite",
sqlite_path="./chembl_dbs/chembl_36.db"
)
# Build a DTA regression dataset
df = pipeline.build_dta(
target_ids=["CHEMBL204", "CHEMBL205", "CHEMBL206"],
split_method="scaffold",
test_size=0.2,
val_size=0.1,
output_path="dataset_dta.csv"
)
# Examine the result
print(f"Total samples: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nSplit distribution:")
print(df["split"].value_counts())
print(f"\nLabel statistics:")
print(df["label"].describe())
Output Dataset Columns¶
The generated dataset contains these key columns:
| Column | Description |
|---|---|
molecule_chembl_id |
ChEMBL molecule identifier |
target_chembl_id |
ChEMBL target identifier |
smiles |
Canonical SMILES string |
pchembl_value |
Standardized activity value |
label |
Continuous pChEMBL value for binding affinity |
split |
Dataset split: train, val, or test |
Example Workflow¶
A complete example: build a dataset, then train both a Random Forest baseline and a GNN on it.
from dta_gnn.io.runs import create_run_dir
from dta_gnn.pipeline import Pipeline
from dta_gnn.models import (
train_random_forest_on_run,
train_gnn_on_run,
GnnTrainConfig,
)
# 1. Create a timestamped run directory (also updates runs/current)
run_dir = create_run_dir()
print(f"Run directory: {run_dir}")
# 2. Build the dataset (writes dataset.csv to the run directory)
pipeline = Pipeline(source_type="sqlite", sqlite_path="./chembl_dbs/chembl_36.db")
df = pipeline.build_dta(
target_ids=["CHEMBL204"],
split_method="scaffold",
test_size=0.2,
val_size=0.1,
output_path=str(run_dir / "dataset.csv"),
)
# 3. Save compounds.csv — required by all on-run trainers
df[["molecule_chembl_id", "smiles"]].drop_duplicates().to_csv(
run_dir / "compounds.csv", index=False,
)
# 4. Train a Random Forest baseline (Morgan ECFP4 fingerprints)
rf = train_random_forest_on_run(run_dir, n_estimators=500)
print("RF test RMSE:", rf.metrics["splits"]["test"]["rmse"])
# 5. Train a GNN (PyTorch Geometric)
gnn = train_gnn_on_run(run_dir, config=GnnTrainConfig(
architecture="gin",
hidden_dim=128,
num_layers=3,
epochs=20,
))
print("GNN test RMSE:", gnn.metrics["splits"]["test"]["rmse"])
Note —
Pipeline.build_dtareturns the DataFrame and writesdataset.csv, but it does not writecompounds.csv. The on-run trainers (train_random_forest_on_run,train_svr_on_run,train_gnn_on_run) require both files in the run directory.
Dataset Splits Explained¶
DTA-GNN offers several splitting strategies:
| Strategy | Description | Use Case |
|---|---|---|
| Random | Random assignment | Baseline, general ML |
| Scaffold | Based on molecular scaffolds | Drug discovery, prevents scaffold leakage |
| Temporal | Based on publication year | Simulates prospective prediction |
Data Leakage
Always use scaffold split for drug discovery applications. Random splits can lead to overly optimistic performance estimates due to scaffold leakage.
Next Steps¶
Now that you've built your first dataset:
- Data Sources - Learn about web vs SQLite sources
- Cleaning - Understand data standardization
- Splits - Deep dive into splitting strategies
- Models - Train baseline models
- Hyperopt - Optimize hyperparameters
Common Issues¶
No activities found¶
# Check if your target IDs are valid
from dta_gnn.io.sqlite_source import ChemblSQLiteSource
source = ChemblSQLiteSource("chembl_36.db")
targets = source.fetch_targets(["CHEMBL204"])
print(targets)
Dataset is empty after cleaning¶
This usually means all activities were filtered out. Try:
- Using more targets
- Including more standard types (IC50, Ki, Kd)
- Checking that your target IDs are valid
Memory errors with large datasets¶
For datasets with millions of activities: