Visualization¶
DTA-GNN provides visualization functions to explore datasets, analyze activity distributions, and visualize chemical space.
Available Functions¶
| Function | Purpose | Output |
|---|---|---|
plot_activity_distribution |
Histogram of pChEMBL values | Bar chart |
plot_split_sizes |
Train/val/test distribution | Bar chart |
plot_chemical_space |
2D projection of molecular diversity | Scatter plot |
Activity Distribution¶
Visualize the distribution of binding affinity values (pChEMBL) in your dataset.
from dta_gnn.visualization import plot_activity_distribution
import matplotlib.pyplot as plt
fig = plot_activity_distribution(
df,
title="Binding Affinity Distribution"
)
plt.savefig("activity_dist.png")
plt.show()
Features:
- Bins pChEMBL values into 0.5-unit intervals
- Automatically handles missing values
- Customizable title
Example output:
Split Sizes¶
Visualize the distribution of samples across train/validation/test splits.
from dta_gnn.visualization import plot_split_sizes
import matplotlib.pyplot as plt
fig = plot_split_sizes(df)
plt.savefig("split_sizes.png")
plt.show()
Features:
- Shows count for each split
- Color-coded bars
- Automatic labeling
Chemical Space Visualization¶
Project molecular structures into 2D space using dimensionality reduction techniques.
from dta_gnn.visualization import plot_chemical_space
import matplotlib.pyplot as plt
# Option 1: Dictionary with groups
smiles_dict = {
"train": train_df["smiles"].tolist(),
"test": test_df["smiles"].tolist()
}
fig = plot_chemical_space(
smiles_dict,
method="t-SNE", # or "PCA"
radius=2, # Morgan FP radius
n_bits=1024, # Fingerprint size
perplexity=30, # t-SNE perplexity
random_state=42
)
plt.savefig("chemical_space.png")
plt.show()
# Option 2: Flat list
smiles_list = df["smiles"].tolist()
fig = plot_chemical_space(smiles_list, method="PCA")
Parameters¶
| Parameter | Default | Description |
|---|---|---|
method |
"t-SNE" | Dimensionality reduction: "t-SNE" or "PCA" |
radius |
2 | Morgan fingerprint radius |
n_bits |
1024 | Fingerprint bit size |
perplexity |
30 | t-SNE perplexity (5-50 recommended) |
learning_rate |
200.0 | t-SNE learning rate |
random_state |
42 | Random seed for reproducibility |
When to Use Each Method¶
t-SNE: - ✅ Better for visualizing clusters - ✅ Non-linear relationships - ❌ Slower for large datasets (>10k molecules) - ❌ Perplexity tuning required
PCA: - ✅ Fast and deterministic - ✅ Good for large datasets - ✅ Preserves global structure - ❌ Linear projections only
Example: Comparing Splits¶
from dta_gnn.visualization import plot_chemical_space
# Group by split
train_smiles = df[df["split"] == "train"]["smiles"].tolist()
val_smiles = df[df["split"] == "val"]["smiles"].tolist()
test_smiles = df[df["split"] == "test"]["smiles"].tolist()
smiles_by_split = {
"Train": train_smiles,
"Validation": val_smiles,
"Test": test_smiles
}
fig = plot_chemical_space(
smiles_by_split,
method="t-SNE",
perplexity=30
)
This helps verify that: - Test set covers different chemical space (scaffold split) - No obvious clustering by split (random split) - Chemical diversity is maintained
Top-K Visualization¶
When visualizing model predictions, you can filter to show only the top-K highest binding affinity predictions from the test set. This is useful for identifying the most promising compounds.
Usage in UI¶
- Go to the Visualization tab
- Select "Model Predictions" as the color scheme
- Select a trained model
- Enable "Show Top-K Test Predictions"
- Set K value (10-1000, default: 100)
- Click "Generate Visualization"
The visualization will show only the top-K molecules with highest predicted binding affinity from the test set, making it easier to identify the most promising compounds.
Use Cases¶
- Hit Identification: Focus on the most promising compounds for further analysis
- Model Validation: Verify that top predictions align with chemical intuition
- Scaffold Analysis: Identify common scaffolds among top predictions
- Diversity Analysis: Check if top-K compounds are diverse or clustered
Example Workflow¶
# 1. Train a model and generate predictions
from dta_gnn.models import train_gnn_on_run, GnnTrainConfig
config = GnnTrainConfig(architecture="gin", epochs=50)
result = train_gnn_on_run("runs/current", config=config)
# 2. Extract embeddings
from dta_gnn.models import extract_gnn_embeddings_on_run
extract_gnn_embeddings_on_run("runs/current")
# 3. In UI: Visualize with top-K filtering
# - Select "Model Predictions" color scheme
# - Enable "Show Top-K Test Predictions"
# - Set K=50 to see top 50 predictions
# - Analyze the chemical space of top compounds
Interpreting Top-K Results¶
- Clustering: If top-K compounds cluster together, they may share similar scaffolds
- Diversity: If top-K compounds are spread out, the model identifies diverse promising compounds
- Ground Truth Alignment: Compare top-K predictions with actual high-affinity compounds in the dataset
Complete Visualization Workflow¶
import matplotlib.pyplot as plt
from dta_gnn.visualization import (
plot_activity_distribution,
plot_split_sizes,
plot_chemical_space
)
# 1. Activity distribution
fig1 = plot_activity_distribution(df, title="Dataset Activity Distribution")
plt.savefig("activity_dist.png", dpi=150, bbox_inches="tight")
plt.close(fig1)
# 2. Split sizes
fig2 = plot_split_sizes(df)
plt.savefig("split_sizes.png", dpi=150, bbox_inches="tight")
plt.close(fig2)
# 3. Chemical space
smiles_dict = {
"Train": df[df["split"] == "train"]["smiles"].tolist(),
"Test": df[df["split"] == "test"]["smiles"].tolist()
}
fig3 = plot_chemical_space(smiles_dict, method="t-SNE")
plt.savefig("chemical_space.png", dpi=150, bbox_inches="tight")
plt.close(fig3)
Integration with UI¶
The web UI automatically generates these visualizations when building datasets:
- Activity Distribution: Shown in Dataset Builder tab
- Split Sizes: Displayed after dataset creation
- Chemical Space: Available in the visualization panel
Performance Tips¶
Large Datasets¶
For datasets with >10,000 molecules:
# Use PCA instead of t-SNE
fig = plot_chemical_space(
smiles_list,
method="PCA",
n_bits=512 # Smaller fingerprint for speed
)
# Or sample before visualization
import random
sampled = random.sample(smiles_list, 5000)
fig = plot_chemical_space(sampled, method="t-SNE")
Memory Optimization¶
# Process in batches for very large datasets
def visualize_large_dataset(df, batch_size=5000):
all_figs = []
for i in range(0, len(df), batch_size):
batch = df.iloc[i:i+batch_size]
fig = plot_activity_distribution(batch)
all_figs.append(fig)
plt.close(fig) # Free memory
return all_figs
Troubleshooting¶
Empty plots¶
If plots appear empty:
# Check for missing data
print(df["pchembl_value"].notna().sum())
print(df["split"].value_counts())
# Filter before plotting
df_clean = df.dropna(subset=["pchembl_value", "split"])
fig = plot_activity_distribution(df_clean)
t-SNE errors¶
If t-SNE fails:
# Reduce perplexity for small datasets
n_samples = len(smiles_list)
perplexity = min(30, max(5, n_samples // 4))
fig = plot_chemical_space(
smiles_list,
method="t-SNE",
perplexity=perplexity
)
Invalid SMILES¶
The visualization functions automatically skip invalid SMILES, but you can check: