Beyond the ladder20 more notebooks for every step of your SAE journey.
Post-train discovery, one-click steering, model coverage, research replication, safety monitoring — every notebook opens in Colab or Kaggle directly.
Closes the loop
You have an SAE. Now understand it, share it, edit it.Discover your features
Auto-label your SAE with an LLM judge
You trained an SAE. Now what? This notebook streams activations, ranks features by interestingness, sends top-activating examples to Claude or GPT-4, and returns a feature_catalog.json with 1-sentence descriptions.
⏱ ~20 min · Colab T4
▸ Colab Free · ANTHROPIC_API_KEY or OPENAI_API_KEY
Build a shareable Trace
Your SAE + your prompt → trace.json + shareable URL
Generate a TraceData JSON (exact Trace Theater schema) for a custom prompt + SAE. Emits the same format /observatory/trace consumes. Upload to HF and share the URL.
⏱ ~5 min · Colab T4
▸ Colab Free
Steer your model
Live feature intervention — baseline vs α ∈ [-3, 0, 1, 3]
Pick a feature, slide its activation coefficient, regenerate. Shows causal effect side-by-side. Q1 preview of the Q2 Sandbox. Exports interventions.json for inclusion in Trace Theater counterfactuals.
⏱ ~3 min · Colab T4
▸ Colab Free
Reduce friction
Pick the right tier before you spin up a GPU.Pick your tier
VRAM calculator + layer recommender
Interactive "what tier should I train?" advisor. Auto-detects your GPU, asks for time budget, recommends a notebook + model + layer. Zero GPU required.
⏱ < 1 min · CPU fine
▸ Anywhere
More models
Same recipe, different architectures.Llama-3.1-8B SAE
Tier 2 port — Llama-3.1-8B on Kaggle free
Train an SAE on the most popular open model. 100M tokens on Kaggle 2× T4 in ~5-6h, HF resumable checkpoints, standard .model.layers path.
⏱ 5–6 h · Kaggle 2× T4
▸ Kaggle Free · Meta license acceptance required
Mistral-7B SAE
Tier 2 port — Mistral-7B-v0.3 on Kaggle free
Clean decoder, sliding-window attention is transparent to SAE training. Same Kaggle recipe as Llama, swaps the model. HF resumable checkpoints.
⏱ 4–5 h · Kaggle 2× T4
▸ Kaggle Free
Phi-3-mini SAE
Tier 1 alt — even faster hobbyist path
Microsoft Phi-3-mini (3.8B) fits comfortably on Colab free T4. 20-min training, Drive checkpoints, first-feature-discovery gift-wrapped.
⏱ ~20 min · Colab T4
▸ Colab Free
Research-grade
Replicate published results. Write your paper.Stage Gate G1 — correlation pre-test
ρ ≥ 0.30 or don't burn GPU on RL
Replicates the Stage Gate 1 protocol from mechreward. Computes Spearman ρ between your SAE feature pack and GSM8K correctness on 100 held-out samples. Pass/fail + scatter plot + report upload.
⏱ 20–30 min · Colab T4
▸ Colab · any tier
BatchTopK vs TopK
Replicate arxiv:2412.06410
Train TopK and BatchTopK on identical activation batches, compare Pareto (var_exp, L0, dead%). Shows where BatchTopK dominates and by how much.
⏱ ~45 min · Colab T4
▸ Colab Free
Circuits
Attribution graphs between SAE features. View with /observatory/circuits.Attribution Patching (AtP*)
Kramár 2024 — QK-fix + GradDrop · node attribution
Compute per-feature attribution scores on your SAE using AtP* (the 2-forward-1-backward linearization). Mean-ablation baseline, QK-fix for attention heads, GradDrop for sign-cancellation robustness. Emits circuit JSON for the Canvas viewer.
⏱ ~15 min · Colab T4
▸ Colab Free
Sparse Feature Circuits (Marks 2024)
arxiv:2403.19647 replication · node + edge DAG
Full replication of Marks et al. 2024. Node attribution via AtP + IG-10 fallback for early layers. Edge attribution via Appendix A.1 (upstream decoder × downstream encoder × upstream delta × downstream gradient). SAE error terms as triangle nodes.
⏱ ~20 min · A100
▸ Colab Pro A100
ACDC slow-mode via AutoCircuit
NeurIPS 2023 algorithm · independent verification
Run the original ACDC algorithm (Conmy 2023) using AutoCircuit (UFO-101 — the practitioner-default fork). Slower than AtP but peer-reviewed. Compare faithfulness curves across methods. Emits circuit.json compatible with the Canvas viewer.
⏱ 1–2 h · Colab T4
▸ Colab · any tier
Train a Sparse Crosscoder
Lindsey 2024 · shared dictionary across 3+ layers
Train a single crosscoder that reads and writes across multiple residual layers simultaneously. Unifies L11/L31/L55-style multi-layer SAEs into one feature index. Greenfield — not yet in SAELens. Classifies features as persistent / early-only / late-only / mixed.
⏱ ~30 min · T4 (20M tok) · scales to paper-grade
▸ Colab Free · T4
Leaderboard
Rank your SAE on the public InterpScore leaderboard.InterpScore v0.0.1 — rank your SAE
Composite metric · submit to the leaderboard
Compute the InterpScore of your SAE: loss_recovered + alive features + L0 sweet spot + sparse probing + TPP causal faithfulness. Emits interpscore.json, ready to PR into the public leaderboard at openinterp.org/interpscore.
⏱ ~20 min · Colab T4
▸ Colab Free · Gemma-2-2B default
Lenses
Classic tools — see what each layer is predicting.Logit Lens — per-layer predictions
nostalgebraist 2020 · 5 lines of PyTorch
Apply final_ln + unembed to every intermediate residual stream. See what the model is "thinking" at each depth. Pure transformers — no TransformerLens dep. Handles Llama/Gemma/Qwen/GPT-2/multimodal paths.
⏱ ~5 min · Colab T4
▸ Colab Free
Tuned Lens — calibrated predictions
Belrose 2023 · pretrained or fresh-fit
Per-layer affine transformation that fixes Logit Lens under-specification. Tries pretrained checkpoints first (GPT-2, Pythia, Llama-3-8B, OPT, Vicuna); falls back to 200-step fresh training on the Pile (~20 min on T4).
⏱ 2 min (pretrained) · 20 min (fresh fit)
▸ Colab Free
Probing
The supervised baselines that SAE features must beat.Linear Probe — the SAE baseline
Alain & Bengio 2016 · the indispensable baseline
Fit sklearn LogisticRegression on residual-stream activations. Per-layer AUROC sweep. Diff-of-means baseline shipped (Farquhar 2023 critique). This is the number any SAE feature-pack must beat.
⏱ ~10 min · Colab T4
▸ Colab Free
CCS — Contrast Consistent Search
Burns 2022 · unsupervised truth-probing, with honest baselines
Replicates Burns et al. 2022 CCS on IMDB or TruthfulQA. Ships diff-of-means + supervised LR ceiling alongside CCS per Farquhar 2023 critique. Best-of-10 restarts. Honest verdict when CCS adds no value over diff-of-means.
⏱ ~15 min · Colab T4
▸ Colab Free
RepE reading vector (LAT)
Zou 2023 · extract + monitor + steer a concept
Linear Artificial Tomography. 32 contrastive prompt pairs → PCA → first component is the "honesty" / "sycophancy" / "refusal" / "confidence" direction. Monitor new prompts. Confirmed causal via ±α steering at the end.
⏱ ~10 min · Colab T4
▸ Colab Free
Safety + production
Q4 Watchtower preview.Watchtower preview — monitor input prompts
Detect anomalous feature activations in production traffic
Q1 preview of the Q4 Watchtower Enterprise API. Streams input prompts, measures watchlist feature activations, flags anomalies above threshold, emits dashboard-style report. Forward-only, no generation.
⏱ ~5 min · any Colab
▸ Colab · any tier