Dataset Splitting¶

Proper dataset splitting is critical for valid model evaluation in drug discovery. DTA-GNN provides several splitting strategies to prevent data leakage and ensure realistic performance estimates.

Why Splitting Matters¶

Data Leakage

Random splitting in drug discovery often leads to scaffold leakage: molecules with the same chemical scaffold appear in both train and test sets, causing inflated performance metrics.

The choice of splitting strategy should match your deployment scenario:

Scenario	Recommended Split
Predict activity for new scaffolds	Scaffold split
Prospective prediction of future compounds	Temporal split
General baseline	Random split

Available Strategies¶

Random Split¶

Random assignment of samples to train/val/test sets.

from dta_gnn.splits import split_random

df_split, train, val, test = split_random(
    df,
    test_size=0.2,
    val_size=0.1,
    seed=42
)

Characteristics:

✅ Simple and fast
✅ Good for general ML benchmarks
❌ Causes scaffold leakage
❌ Overly optimistic performance estimates

Use when:

Establishing a baseline
Dataset doesn't have scaffold structure
Computational resources are limited

Scaffold Split (Cold-Drug)¶

Splits based on Murcko scaffolds, ensuring molecules with the same core structure stay in the same partition.

from dta_gnn.splits import split_cold_drug_scaffold

df_split = split_cold_drug_scaffold(
    df,
    smiles_col="smiles",
    test_size=0.2,
    val_size=0.1,
    seed=42
)

Characteristics:

✅ Prevents scaffold leakage
✅ Simulates real drug discovery scenario
✅ More realistic performance estimates
❌ Uneven split sizes possible

Use when:

Evaluating generalization to new chemical series
Drug discovery applications
Publication-quality results

How Scaffold Split Works¶

graph TD
    A[All Molecules] --> B[Extract Murcko Scaffolds]
    B --> C[Group by Scaffold]
    C --> D[Shuffle Scaffold Groups]
    D --> E[Assign to Train until 70%]
    D --> F[Assign to Val until 80%]
    D --> G[Assign rest to Test]

Extract Murcko scaffold for each molecule
Group molecules by scaffold
Shuffle scaffold groups
Assign entire groups to splits (keeping scaffolds together)

Temporal Split¶

Splits based on publication year to simulate prospective prediction.

from dta_gnn.splits import split_temporal

df_split = split_temporal(
    df,
    year_col="year",
    split_year=2022,
    val_size=0.1
)

Characteristics:

✅ Simulates real-world deployment
✅ Tests temporal generalization
✅ Most realistic evaluation
❌ Requires year information
❌ Unbalanced splits if data is concentrated

Use when:

Simulating prospective prediction
Evaluating model stability over time
Regulatory/validation purposes

Configuration via Pipeline¶

The Pipeline class handles splitting automatically:

from dta_gnn.pipeline import Pipeline

pipeline = Pipeline(source_type="sqlite", sqlite_path="chembl_36.db")

df = pipeline.build_dta(
    target_ids=["CHEMBL204", "CHEMBL205"],
    split_method="scaffold",  # Options: random, scaffold, temporal
    test_size=0.2,
    val_size=0.1,
    split_year=2022  # Only for temporal split
)

Split Distribution¶

After splitting, verify the distribution:

print(df["split"].value_counts())
# train    7000
# val      1000
# test     2000

# For scaffold split, also check scaffold overlap
from dta_gnn.audits import audit_scaffold_leakage

train = df[df["split"] == "train"]
test = df[df["split"] == "test"]
audit = audit_scaffold_leakage(train, test)
print(f"Scaffold leakage ratio: {audit['leakage_ratio']:.2%}")

Comparison of Strategies¶

Strategy	Leakage Prevention	Realism	Implementation
Random	None	Low	Simple
Scaffold	Scaffold-level	Medium-High	RDKit required
Temporal	Time-based	Very High	Year data required

Best Practices¶

Splitting Guidelines

Never use random splits for drug discovery - Use scaffold split
Report split strategy clearly - Essential for reproducibility
Check for leakage after splitting - Use the audit functions
Use same seed for reproducibility - Default is 42
Consider multiple splits - For more robust evaluation

CLI Usage¶

from dta_gnn.pipeline import Pipeline

pipeline = Pipeline(source_type="sqlite", sqlite_path="./chembl_dbs/chembl_36.db")

# Scaffold split
df = pipeline.build_dta(target_ids=["CHEMBL204"], split_method="scaffold", output_path="dataset.csv")

# Temporal split
df = pipeline.build_dta(target_ids=["CHEMBL204"], split_method="temporal", split_year=2022, output_path="dataset.csv")

Troubleshooting¶

Uneven split sizes¶

Scaffold split may produce uneven splits due to large scaffold groups:

# Check scaffold distribution
from rdkit.Chem.Scaffolds import MurckoScaffold

scaffolds = df["smiles"].apply(
    lambda s: MurckoScaffold.MurckoScaffoldSmiles(s) if pd.notna(s) else None
)
print(scaffolds.value_counts().head(10))

Missing year column for temporal split¶

Ensure your data includes publication year:

# Check year availability
print(df["year"].notna().sum() / len(df))
# Should be close to 1.0

Very small test set¶

If test set is too small after splitting:

Use more targets
Adjust split ratios
Consider a different splitting strategy