Hyperparameter Optimization¶
DTA-GNN includes integrated hyperparameter optimisation (HPO) for all of its
models. The backend is Weights & Biases Bayesian sweeps. To run without
sending data to the W&B cloud, set WANDB_MODE=offline.
Overview¶
HPO automatically searches for optimal hyperparameters:
| Model | Tunable parameters |
|---|---|
| Random Forest | n_estimators, max_depth, min_samples_split |
| SVR (regression only) | C, epsilon, kernel |
| GNN | learning rate, embedding_dim, hidden_dim, num_layers, dropout, pooling, residual, head_mlp_layers, epochs, batch_size, plus architecture-specific knobs |
The validation strategy is automatic:
- If the run's
dataset.csvcontains avalsplit, holdout validation is used (train ontrain, score onval). - Otherwise the optimiser falls back to 5-fold KFold cross-validation on the non-test rows, using R² as the score (regression).
The test split is always reserved and is never used during HPO.
Installation¶
W&B is included in the default install; no extra step is required. Optional Optuna fallback also ships with the package.
Quick Start¶
Via Web UI¶
- Build a dataset in the Dataset Builder tab.
- Open the HPO tab.
- Pick the model type (RandomForest, SVR, or GNN) and the parameters to optimise.
- Set the number of trials and your W&B project name (and API key if not logged in).
- Click Run Hyperparameter Optimization, then download
hyperopt_best_params_…json.
Via Python API¶
from dta_gnn.models.hyperopt import HyperoptConfig, optimize_random_forest_wandb
config = HyperoptConfig(
model_type="RandomForest",
n_trials=20,
rf_optimize_n_estimators=True, rf_n_estimators_min=50, rf_n_estimators_max=500,
rf_optimize_max_depth=True, rf_max_depth_min=5, rf_max_depth_max=50,
)
result = optimize_random_forest_wandb(
"runs/current",
config=config,
project="my-chembl-project",
)
print("Best params:", result.best_params)
print("Best score :", result.best_value)
HyperoptConfig Reference¶
HyperoptConfig (in dta_gnn.models.hyperopt) is a single dataclass
covering all three model types. Only the fields relevant to your model_type
are read; unused fields are ignored.
Each tunable parameter has an
*_optimize_*boolean flag. The sweep only varies parameters whose flag isTrue. If no flag isTrue, HPO raisesValueError("No parameters selected for optimization.").
from dta_gnn.models.hyperopt import HyperoptConfig
config = HyperoptConfig(
# General
model_type="GNN", # "RandomForest" | "SVR" | "GNN"
n_trials=20,
n_jobs=1, # parallel jobs (RF only)
sampler_seed=42,
# ---------- RandomForest knobs ----------
rf_optimize_n_estimators=True, rf_n_estimators_min=50, rf_n_estimators_max=500,
rf_optimize_max_depth=True, rf_max_depth_min=5, rf_max_depth_max=50,
rf_optimize_min_samples_split=True,
rf_min_samples_split_min=2, rf_min_samples_split_max=20,
# ---------- SVR knobs (regression only) ----------
svr_optimize_C=True, svr_C_min=0.1, svr_C_max=100.0, svr_C_default=10.0,
svr_optimize_epsilon=True, svr_epsilon_min=0.01, svr_epsilon_max=0.2, svr_epsilon_default=0.1,
svr_optimize_kernel=True, svr_kernel_choices=["rbf", "linear"], svr_kernel_default="rbf",
# ---------- GNN knobs ----------
architecture="gin", # gin | gcn | gat | sage | pna | transformer | tag | arma | cheb | supergat
optimize_lr=True, lr_min=1e-5, lr_max=1e-2,
optimize_epochs=True, epochs_min=5, epochs_max=50, epochs_default=20,
optimize_batch_size=True, batch_size_min=16, batch_size_max=256, batch_size_default=64,
optimize_embedding_dim=True, embedding_dim_min=32, embedding_dim_max=512, embedding_dim_default=128,
optimize_hidden_dim=True, hidden_dim_min=32, hidden_dim_max=512, hidden_dim_default=128,
optimize_num_layers=True, num_layers_min=1, num_layers_max=5, num_layers_default=3,
optimize_dropout=True, dropout_min=0.0, dropout_max=0.6, dropout_default=0.1,
optimize_pooling=True, pooling_choices=["add","mean","max","attention"], pooling_default="add",
optimize_residual=True, residual_default=False,
optimize_head_mlp_layers=True, head_mlp_layers_min=1, head_mlp_layers_max=4, head_mlp_layers_default=2,
# GIN-specific
optimize_gin_conv_mlp_layers=True,
gin_conv_mlp_layers_min=1, gin_conv_mlp_layers_max=4, gin_conv_mlp_layers_default=2,
optimize_gin_train_eps=True, gin_train_eps_default=False,
optimize_gin_eps=True, gin_eps_min=0.0, gin_eps_max=1.0, gin_eps_default=0.0,
# GAT-specific
optimize_gat_heads=True, gat_heads_min=1, gat_heads_max=8, gat_heads_default=4,
# GraphSAGE-specific
optimize_sage_aggr=True, sage_aggr_choices=["mean","max","lstm","pool"], sage_aggr_default="mean",
# Transformer-specific
optimize_transformer_heads=True,
transformer_heads_min=1, transformer_heads_max=8, transformer_heads_default=4,
# TAG-specific
optimize_tag_k=True, tag_k_min=1, tag_k_max=5, tag_k_default=2,
# ARMA-specific
optimize_arma_stacks=True,
arma_num_stacks_min=1, arma_num_stacks_max=3, arma_num_stacks_default=1,
optimize_arma_layers=True,
arma_num_layers_min=1, arma_num_layers_max=3, arma_num_layers_default=1,
# Cheb-specific
optimize_cheb_k=True, cheb_k_min=1, cheb_k_max=5, cheb_k_default=2,
# SuperGAT-specific
optimize_supergat_heads=True,
supergat_heads_min=1, supergat_heads_max=8, supergat_heads_default=4,
optimize_supergat_attention_type=True,
supergat_attention_type_choices=["MX","SD"],
supergat_attention_type_default="MX",
)
Model-Specific Sweeps¶
Random Forest¶
from dta_gnn.models.hyperopt import optimize_random_forest_wandb, HyperoptConfig
result = optimize_random_forest_wandb(
run_dir="runs/current",
config=HyperoptConfig(
model_type="RandomForest", n_trials=20,
rf_optimize_n_estimators=True,
rf_optimize_max_depth=True,
),
project="chembl-rf-sweep",
entity=None, # optional W&B team
api_key=None, # optional; uses logged-in session by default
sweep_name="rf_sweep_1",
radius=2, # Morgan FP radius
n_bits=2048, # Morgan FP bits
)
| Parameter | Default range | Description |
|---|---|---|
n_estimators |
50–500 | Number of trees |
max_depth |
5–50 | Maximum tree depth |
min_samples_split |
2–20 | Minimum samples per split |
SVR (regression only)¶
from dta_gnn.models.hyperopt import optimize_svr_wandb, HyperoptConfig
config = HyperoptConfig(
model_type="SVR",
n_trials=20,
svr_optimize_C=True, svr_C_min=0.1, svr_C_max=100.0,
svr_optimize_epsilon=True, svr_epsilon_min=0.01, svr_epsilon_max=0.2,
svr_optimize_kernel=True, svr_kernel_choices=["rbf", "linear"],
)
result = optimize_svr_wandb("runs/current", config=config, project="chembl-svr-sweep")
print("Best:", result.best_params, "score:", result.best_value)
| Parameter | Default range | Description |
|---|---|---|
C |
0.1–100 (log) | Regularisation |
epsilon |
0.01–0.2 (log) | Epsilon-tube width |
kernel |
rbf, linear |
Kernel function |
SVR sweeps require a regression dataset; classification runs raise
ValueError.
GNN¶
from dta_gnn.models.hyperopt import optimize_gnn_wandb, HyperoptConfig
config = HyperoptConfig(
model_type="GNN",
n_trials=20,
architecture="gin",
optimize_lr=True,
optimize_dropout=True,
optimize_pooling=True,
)
result = optimize_gnn_wandb("runs/current", config=config, project="chembl-gnn-sweep")
print("Best:", result.best_params, "score:", result.best_value)
Architecture-specific tunables:
| Architecture | Parameter | Range |
|---|---|---|
| GIN | gin_conv_mlp_layers / gin_train_eps / gin_eps |
1–4 / bool / 0.0–1.0 |
| GAT | gat_heads |
1–8 |
| GraphSAGE | sage_aggr |
mean/max/lstm/pool |
| Transformer | transformer_heads |
1–8 |
| TAG | tag_k |
1–5 |
| ARMA | arma_num_stacks / arma_num_layers |
1–3 / 1–3 |
| Cheb | cheb_k |
1–5 |
| SuperGAT | supergat_heads / supergat_attention_type |
1–8 / MX or SD |
Validation Strategy¶
If dataset.csv contains a val split:
Otherwise:
(For regression. The CV is plain KFold, not stratified.)
HyperoptResult¶
@dataclass
class HyperoptResult:
run_dir: Path
best_params: dict # winning hyperparameters
best_value: float # best metric (validation R² for GNN/SVR/RF)
best_trial_number: int # 0-indexed winning trial
n_trials: int
study_path: str # W&B sweep ID (or Optuna study path)
best_params_path: str # JSON file with best_params
strategy: str # "holdout-val" or "cv"
cv_folds_used: int | None # only set when strategy == "cv"
A hyperopt_best_params_*.json file is written into the run directory:
{
"n_estimators": 342,
"max_depth": 28,
"min_samples_split": 5,
"radius": 2,
"n_bits": 2048,
"task_type": "regression"
}
Weights & Biases Integration¶
import wandb
# Option 1: log in interactively
wandb.login()
# Option 2: environment variable
# export WANDB_API_KEY=<your-key>
# Option 3: pass to the sweep function
optimize_random_forest_wandb(..., api_key="<your-key>")
Open the corresponding W&B project to inspect parameter importance, parallel-coordinates plots, and per-trial training curves.
To run without W&B (offline):
Aliases¶
For backward compatibility, the package exposes the following aliases:
| Alias | Resolves to |
|---|---|
optimize_random_forest |
optimize_random_forest_wandb |
optimize_gnn |
optimize_gnn_wandb |
optimize_gin |
optimize_gnn_wandb |
They behave identically to the canonical names — there is no separate Optuna backend at this time.
Running offline (no W&B cloud)¶
Or in Python:
import os
os.environ["WANDB_MODE"] = "offline"
from dta_gnn.models.hyperopt import optimize_gnn_wandb, HyperoptConfig
result = optimize_gnn_wandb(
"runs/current",
config=HyperoptConfig(model_type="GNN", n_trials=10, optimize_lr=True),
project="local-only",
)
Best Practices¶
| Dataset size | Recommended trials |
|---|---|
| < 1 000 | 10–20 |
| 1 000–10 000 | 20–50 |
| > 10 000 | 50–100 |
- Start broad (enable many
*_optimize_*flags), then narrow. - Widen ranges if the best value lands at a boundary.
- Smaller
epochs_maxand a largerbatch_size_minmake GNN sweeps cheaper.
Troubleshooting¶
ValueError: No parameters selected for optimization.
At least one *_optimize_* flag must be True.
All trials report the same score. Widen the search ranges, increase the dataset, or sanity-check the labels.
W&B connection issues.
wandb.login(verify=True) to test, or run with WANDB_MODE=offline.
CUDA OOM during GNN sweeps.
Reduce batch_size_max, hidden_dim_max, and/or embedding_dim_max.
Complete Example¶
from dta_gnn.io.runs import create_run_dir
from dta_gnn.pipeline import Pipeline
from dta_gnn.models.hyperopt import HyperoptConfig, optimize_gnn_wandb
from dta_gnn.models import GnnTrainConfig, train_gnn_on_run
# 1. Build a dataset that includes a val split
run_dir = create_run_dir()
pipeline = Pipeline(source_type="sqlite", sqlite_path="./chembl_dbs/chembl_36.db")
df = pipeline.build_dta(
target_ids=["CHEMBL204"],
split_method="scaffold",
val_size=0.1,
output_path=str(run_dir / "dataset.csv"),
)
df[["molecule_chembl_id", "smiles"]].drop_duplicates().to_csv(
run_dir / "compounds.csv", index=False,
)
# 2. Sweep
hpo = optimize_gnn_wandb(
run_dir,
config=HyperoptConfig(
model_type="GNN", n_trials=20,
architecture="gin",
optimize_lr=True,
optimize_dropout=True,
),
project="chembl-hpo",
)
print("Best params:", hpo.best_params)
# 3. Final training with the best params
final_cfg = GnnTrainConfig(
architecture="gin",
lr=float(hpo.best_params.get("lr", 1e-3)),
dropout=float(hpo.best_params.get("dropout", 0.1)),
epochs=int(hpo.best_params.get("epochs", 30)),
)
final = train_gnn_on_run(run_dir, config=final_cfg)
print("Test:", final.metrics["splits"]["test"])