Dataset Splitting¶
Proper dataset splitting is critical for valid model evaluation in drug discovery. DTA-GNN provides several splitting strategies to prevent data leakage and ensure realistic performance estimates.
Why Splitting Matters¶
Data Leakage
Random splitting in drug discovery often leads to scaffold leakage: molecules with the same chemical scaffold appear in both train and test sets, causing inflated performance metrics.
The choice of splitting strategy should match your deployment scenario:
| Scenario | Recommended Split |
|---|---|
| Predict activity for new scaffolds | Scaffold split |
| Prospective prediction of future compounds | Temporal split |
| General baseline | Random split |
Available Strategies¶
Random Split¶
Random assignment of samples to train/val/test sets.
from dta_gnn.splits import split_random
df_split, train, val, test = split_random(
df,
test_size=0.2,
val_size=0.1,
seed=42
)
Characteristics:
- ✅ Simple and fast
- ✅ Good for general ML benchmarks
- ❌ Causes scaffold leakage
- ❌ Overly optimistic performance estimates
Use when:
- Establishing a baseline
- Dataset doesn't have scaffold structure
- Computational resources are limited
Scaffold Split (Cold-Drug)¶
Splits based on Murcko scaffolds, ensuring molecules with the same core structure stay in the same partition.
from dta_gnn.splits import split_cold_drug_scaffold
df_split = split_cold_drug_scaffold(
df,
smiles_col="smiles",
test_size=0.2,
val_size=0.1,
seed=42
)
Characteristics:
- ✅ Prevents scaffold leakage
- ✅ Simulates real drug discovery scenario
- ✅ More realistic performance estimates
- ❌ Uneven split sizes possible
Use when:
- Evaluating generalization to new chemical series
- Drug discovery applications
- Publication-quality results
How Scaffold Split Works¶
graph TD
A[All Molecules] --> B[Extract Murcko Scaffolds]
B --> C[Group by Scaffold]
C --> D[Shuffle Scaffold Groups]
D --> E[Assign to Train until 70%]
D --> F[Assign to Val until 80%]
D --> G[Assign rest to Test]
- Extract Murcko scaffold for each molecule
- Group molecules by scaffold
- Shuffle scaffold groups
- Assign entire groups to splits (keeping scaffolds together)
Temporal Split¶
Splits based on publication year to simulate prospective prediction.
from dta_gnn.splits import split_temporal
df_split = split_temporal(
df,
year_col="year",
split_year=2022,
val_size=0.1
)
Characteristics:
- ✅ Simulates real-world deployment
- ✅ Tests temporal generalization
- ✅ Most realistic evaluation
- ❌ Requires year information
- ❌ Unbalanced splits if data is concentrated
Use when:
- Simulating prospective prediction
- Evaluating model stability over time
- Regulatory/validation purposes
Configuration via Pipeline¶
The Pipeline class handles splitting automatically:
from dta_gnn.pipeline import Pipeline
pipeline = Pipeline(source_type="sqlite", sqlite_path="chembl_36.db")
df = pipeline.build_dta(
target_ids=["CHEMBL204", "CHEMBL205"],
split_method="scaffold", # Options: random, scaffold, temporal
test_size=0.2,
val_size=0.1,
split_year=2022 # Only for temporal split
)
Split Distribution¶
After splitting, verify the distribution:
print(df["split"].value_counts())
# train 7000
# val 1000
# test 2000
# For scaffold split, also check scaffold overlap
from dta_gnn.audits import audit_scaffold_leakage
train = df[df["split"] == "train"]
test = df[df["split"] == "test"]
audit = audit_scaffold_leakage(train, test)
print(f"Scaffold leakage ratio: {audit['leakage_ratio']:.2%}")
Comparison of Strategies¶
| Strategy | Leakage Prevention | Realism | Implementation |
|---|---|---|---|
| Random | None | Low | Simple |
| Scaffold | Scaffold-level | Medium-High | RDKit required |
| Temporal | Time-based | Very High | Year data required |
Best Practices¶
Splitting Guidelines
- Never use random splits for drug discovery - Use scaffold split
- Report split strategy clearly - Essential for reproducibility
- Check for leakage after splitting - Use the audit functions
- Use same seed for reproducibility - Default is 42
- Consider multiple splits - For more robust evaluation
CLI Usage¶
from dta_gnn.pipeline import Pipeline
pipeline = Pipeline(source_type="sqlite", sqlite_path="./chembl_dbs/chembl_36.db")
# Scaffold split
df = pipeline.build_dta(target_ids=["CHEMBL204"], split_method="scaffold", output_path="dataset.csv")
# Temporal split
df = pipeline.build_dta(target_ids=["CHEMBL204"], split_method="temporal", split_year=2022, output_path="dataset.csv")
Troubleshooting¶
Uneven split sizes¶
Scaffold split may produce uneven splits due to large scaffold groups:
# Check scaffold distribution
from rdkit.Chem.Scaffolds import MurckoScaffold
scaffolds = df["smiles"].apply(
lambda s: MurckoScaffold.MurckoScaffoldSmiles(s) if pd.notna(s) else None
)
print(scaffolds.value_counts().head(10))
Missing year column for temporal split¶
Ensure your data includes publication year:
Very small test set¶
If test set is too small after splitting:
- Use more targets
- Adjust split ratios
- Consider a different splitting strategy