TRAIN · THE LADDER

Any model. Any scale.

Train your first sparse autoencoder in 30 minutes on a free Colab T4. Train a hybrid-architecture SAE in 4 hours on free Kaggle. Train paper-grade on cloud. One ladder, zero gatekeeping.

beginner
TIER 1 · HOBBYIST

Your first SAE in 30 minutes

PlatformGoogle Colab · Free T4
Cost$0
ModelGemma-2-2B
Tokens50 M
Dictionary7× (n=16k)
Time30–40 min

Train a complete TopK SAE with AuxK dead-feature mitigation on Gemma-2-2B. Drive-based checkpoint recovery handles Colab's 90-minute idle disconnect. Ends with your own SAE uploaded to HuggingFace — citable, reusable, shareable.

What you'll learn
  • Forward hooks + residual stream extraction
  • TopK activation + AuxK auxiliary loss (Gao et al. 2024)
  • Geometric-median b_dec initialization
  • HuggingFace safetensors + cfg.json format
  • Crash-safe checkpointing to Google Drive
Prerequisites
  • Google account (Colab Free access)
  • HuggingFace account + HF_TOKEN in Colab Secrets
  • Edit one line: HF_USERNAME
intermediate
TIER 2 · EXPLORER

Hybrid-architecture SAE — Qwen3.5-4B

PlatformKaggle · 2× T4 (32 GB)
Cost$0 · 30 h/wk
ModelQwen3.5-4B
Tokens150 M
Dictionary16× (n=40k)
Time4–5 h

The first-public-ready SAE recipe for hybrid GDN architectures. Installs transformers from source for qwen3_5 support, uses output_hidden_states path (Qwen3.5 has no .layers), survives Kaggle kernel-kill via HF-resumable checkpoints. Produces a publishable SAE matching the Stage Gate 1 research bar.

What you'll learn
  • Hybrid GDN activation capture (output_hidden_states)
  • transformers-from-source install + restart dance
  • Dual-GPU model/SAE split (model on cuda:0, SAE on cuda:1)
  • HuggingFace streaming checkpoints for kernel-kill recovery
  • Held-out validation + val_report.json publishing
Prerequisites
  • Completed Tier 1, or SAE experience
  • Kaggle account + HF_TOKEN in Kaggle Secrets
  • Basic understanding of Gated Delta Networks (links in notebook)
advanced
TIER 3 · PAPER-GRADE

Paper-grade SAE — Qwen3.6-27B

PlatformVast.ai / Lambda · RTX 6000 Pro (96 GB)
Cost~$30–60 / run
ModelQwen3.6-27B
Tokens200 M
Dictionary13× (n=65k)
Time20–24 h

The Gemma-Scope-27B-parity recipe. 3 TopK SAEs trained in parallel on L11/L31/L55 with a single shared forward pass, 70/20/10 FineWeb-Edu + OpenThoughts + OpenMath corpus mix, and HF streaming checkpoints every 10M tokens so a crash costs at most 10 minutes. This is the notebook behind qwen36-27b-sae-papergrade.

What you'll learn
  • Multi-layer simultaneous SAE training (one forward pass, 3 SAEs)
  • Corpus mixing for reasoning-model SAEs
  • Streaming activation buffer pattern (never OOM)
  • AuxK calibration for large n (d_model/2 heuristic)
  • sae_lens / Neuronpedia-ready export
Prerequisites
  • Completed Tier 1 + Tier 2, or production SAE experience
  • Cloud GPU account (Vast.ai / Lambda / RunPod) with ≥96 GB VRAM
  • HF_TOKEN env var on the cloud instance
Beyond the ladder

39 more notebooks for every step of your SAE journey.

Post-train discovery, one-click steering, model coverage, research replication, safety monitoring — every notebook opens in Colab or Kaggle directly.

Closes the loop

You have an SAE. Now understand it, share it, edit it.
beginner

Discover your features

Auto-label your SAE with an LLM judge

You trained an SAE. Now what? This notebook streams activations, ranks features by interestingness, sends top-activating examples to Claude or GPT-4, and returns a feature_catalog.json with 1-sentence descriptions.

~20 min · Colab T4
Colab Free · ANTHROPIC_API_KEY or OPENAI_API_KEY
intermediate

Auto-interp at scale — paper-grade SAE

1500 features × 32 examples · Claude Opus 4.7 via OpenRouter

The auto-interp pipeline behind feature-catalog.json (1500 labels, ~$80 OpenRouter spend). Streams top-activating examples from the Qwen3.6-27B SAE (L11/L31/L55), filters Pile-noise features, sends to Opus 4.7, and emits per-feature semantic labels. Apache-2.0.

~6 h · Colab T4 (~$0 GPU + LLM credit)
Colab Free · OPENROUTER_API_KEY required
beginner

Auto-interp targeted — circuit features

Label only the features your circuit needs

Smaller-budget variant of 04b — pass a list of (layer, feature_id) tuples (e.g., from Sparse Feature Circuits output) and get labels only for those. Used to fill in 36 missing labels for the medical / IOI / math / refusal circuits in the Circuit Canvas viewer.

~10 min · Colab T4 (~$1 LLM credit)
Colab Free · OPENROUTER_API_KEY required
beginner

Build a shareable Trace

Your SAE + your prompt → trace.json + shareable URL

Generate a TraceData JSON (exact Trace Theater schema) for a custom prompt + SAE. Emits the same format /observatory/trace consumes. Upload to HF and share the URL.

~5 min · Colab T4
Colab Free
intermediate

Steer your model

Live feature intervention — baseline vs α ∈ [-3, 0, 1, 3]

Pick a feature, slide its activation coefficient, regenerate. Shows causal effect side-by-side. Q1 preview of the Q2 Sandbox. Exports interventions.json for inclusion in Trace Theater counterfactuals.

~3 min · Colab T4
Colab Free

Reduce friction

Pick the right tier before you spin up a GPU.
beginner

Pick your tier

VRAM calculator + layer recommender

Interactive "what tier should I train?" advisor. Auto-detects your GPU, asks for time budget, recommends a notebook + model + layer. Zero GPU required.

< 1 min · CPU fine
Anywhere

More models

Same recipe, different architectures.
intermediate

Llama-3.1-8B SAE

Tier 2 port — Llama-3.1-8B on Kaggle free

Train an SAE on the most popular open model. 100M tokens on Kaggle 2× T4 in ~5-6h, HF resumable checkpoints, standard .model.layers path.

5–6 h · Kaggle 2× T4
Kaggle Free · Meta license acceptance required
intermediate

Mistral-7B SAE

Tier 2 port — Mistral-7B-v0.3 on Kaggle free

Clean decoder, sliding-window attention is transparent to SAE training. Same Kaggle recipe as Llama, swaps the model. HF resumable checkpoints.

4–5 h · Kaggle 2× T4
Kaggle Free
beginner

Phi-3-mini SAE

Tier 1 alt — even faster hobbyist path

Microsoft Phi-3-mini (3.8B) fits comfortably on Colab free T4. 20-min training, Drive checkpoints, first-feature-discovery gift-wrapped.

~20 min · Colab T4
Colab Free

Research-grade

Replicate published results. Write your paper.
intermediate

Stage Gate G1 — correlation pre-test

ρ ≥ 0.30 or don't burn GPU on RL

Replicates the Stage Gate 1 protocol from mechreward. Computes Spearman ρ between your SAE feature pack and GSM8K correctness on 100 held-out samples. Pass/fail + scatter plot + report upload.

20–30 min · Colab T4
Colab · any tier
advanced

BatchTopK vs TopK

Replicate arxiv:2412.06410

Train TopK and BatchTopK on identical activation batches, compare Pareto (var_exp, L0, dead%). Shows where BatchTopK dominates and by how much.

~45 min · Colab T4
Colab Free

Circuits

Attribution graphs between SAE features. View with /observatory/circuits.
intermediate

Attribution Patching (AtP*)

Kramár 2024 — QK-fix + GradDrop · node attribution

Compute per-feature attribution scores on your SAE using AtP* (the 2-forward-1-backward linearization). Mean-ablation baseline, QK-fix for attention heads, GradDrop for sign-cancellation robustness. Emits circuit JSON for the Canvas viewer.

~15 min · Colab T4
Colab Free
advanced

Sparse Feature Circuits (Marks 2024)

arxiv:2403.19647 replication · node + edge DAG

Full replication of Marks et al. 2024. Node attribution via AtP + IG-10 fallback for early layers. Edge attribution via Appendix A.1 (upstream decoder × downstream encoder × upstream delta × downstream gradient). SAE error terms as triangle nodes.

~20 min · A100
Colab Pro A100
advanced

ACDC slow-mode via AutoCircuit

NeurIPS 2023 algorithm · independent verification

Run the original ACDC algorithm (Conmy 2023) using AutoCircuit (UFO-101 — the practitioner-default fork). Slower than AtP but peer-reviewed. Compare faithfulness curves across methods. Emits circuit.json compatible with the Canvas viewer.

1–2 h · Colab T4
Colab · any tier
advanced

Sparse Feature Circuits — paper-grade 27B

Marks 2024 method on Qwen3.6-27B / 65k features per layer

Scaled-up companion to notebook 15. Same SFC pipeline (node attribution via AtP, edge attribution via Marks Appendix A.1) but on the published Qwen3.6-27B SAE across L11/L31/L55. Emits the circuit JSON files consumed by /observatory/circuits (medical, IOI, math, refusal scenarios).

~1 h · A100 / RTX 6000 Pro
Cloud GPU · ≥48 GB VRAM
advanced

Train a Sparse Crosscoder

Lindsey 2024 · shared dictionary across 3+ layers

Train a single crosscoder that reads and writes across multiple residual layers simultaneously. Unifies L11/L31/L55-style multi-layer SAEs into one feature index. Greenfield — not yet in SAELens. Classifies features as persistent / early-only / late-only / mixed.

~30 min · T4 (20M tok) · scales to paper-grade
Colab Free · T4
advanced

Cross-model crosscoder + Pearson_CE

Gemma-2-2B base/IT · BatchTopK · cosine vs causal universality

Companion to 17 (cross-layer). This is cross-MODEL — diff base vs IT-tuned variant of Gemma-2-2B with BatchTopK + decoder-norm sparsity. First per-feature Pearson causal-equivalence test in the literature. Median cosine 0.965 vs median CE 0.616 on shared features (38% gap). The methodology paper-1 uses for ICML MI Workshop 2026.

~5 h · A100 / RTX 6000 Pro
Cloud GPU · ≥40 GB VRAM
advanced

RL-diffing crosscoder — base vs mechreward

Qwen3.5-4B base vs G3-LoRA · LoRA toggle pattern

Cross-stage crosscoder companion to 17b. Single base model + LoRA toggle via PEFT.disable_adapter() for activation collection. Diffs Qwen3.5-4B base against the mechreward-G3 LoRA (GSM8K 64%→83%) — first cross-stage RL diffing crosscoder on hybrid GDN architecture. Hypothesis: RL preserves residuals at L18 but rewires downstream consumers — Pearson_CE catches it.

~5 h · A100 / RTX 6000 Pro
Cloud GPU · ≥40 GB VRAM

Leaderboard

Rank your SAE on the public InterpScore leaderboard.
intermediate

InterpScore v0.0.1 — rank your SAE

Composite metric · submit to the leaderboard

Compute the InterpScore of your SAE: loss_recovered + alive features + L0 sweet spot + sparse probing + TPP causal faithfulness. Emits interpscore.json, ready to PR into the public leaderboard at openinterp.org/interpscore.

~20 min · Colab T4
Colab Free · Gemma-2-2B default
advanced

InterpScore on the paper-grade 27B SAE

Real numbers — L11=0.7788 / L31=0.7600 / L55=0.7507

Computes InterpScore v0.0.1 on the public caiovicentino1/qwen36-27b-sae-papergrade SAE. Loss-recovered + alive features + L0 sweet spot + sparse probing + TPP causal faithfulness with proportional k=0.5% of d_sae. Emits interpscore.json identical to what we submitted to the public leaderboard.

~3 h · A100 / RTX 6000 Pro
Cloud GPU · ≥48 GB VRAM

Lenses

Classic tools — see what each layer is predicting.
beginner

Logit Lens — per-layer predictions

nostalgebraist 2020 · 5 lines of PyTorch

Apply final_ln + unembed to every intermediate residual stream. See what the model is "thinking" at each depth. Pure transformers — no TransformerLens dep. Handles Llama/Gemma/Qwen/GPT-2/multimodal paths.

~5 min · Colab T4
Colab Free
intermediate

Tuned Lens — calibrated predictions

Belrose 2023 · pretrained or fresh-fit

Per-layer affine transformation that fixes Logit Lens under-specification. Tries pretrained checkpoints first (GPT-2, Pythia, Llama-3-8B, OPT, Vicuna); falls back to 200-step fresh training on the Pile (~20 min on T4).

2 min (pretrained) · 20 min (fresh fit)
Colab Free

Probing

The supervised baselines that SAE features must beat.
beginner

Linear Probe — the SAE baseline

Alain & Bengio 2016 · the indispensable baseline

Fit sklearn LogisticRegression on residual-stream activations. Per-layer AUROC sweep. Diff-of-means baseline shipped (Farquhar 2023 critique). This is the number any SAE feature-pack must beat.

~10 min · Colab T4
Colab Free
intermediate

CCS — Contrast Consistent Search

Burns 2022 · unsupervised truth-probing, with honest baselines

Replicates Burns et al. 2022 CCS on IMDB or TruthfulQA. Ships diff-of-means + supervised LR ceiling alongside CCS per Farquhar 2023 critique. Best-of-10 restarts. Honest verdict when CCS adds no value over diff-of-means.

~15 min · Colab T4
Colab Free
intermediate

RepE reading vector (LAT)

Zou 2023 · extract + monitor + steer a concept

Linear Artificial Tomography. 32 contrastive prompt pairs → PCA → first component is the "honesty" / "sycophancy" / "refusal" / "confidence" direction. Monitor new prompts. Confirmed causal via ±α steering at the end.

~10 min · Colab T4
Colab Free

Hallucination — detection & steering

The full research arc behind the 2026-04-25 blog post: Ferrando replication on 27B, single-feature null, multi-feature ablation under random-K + Claude-judge controls. Plus the ICML MI Workshop paper-1 baseline notebook.
intermediate

Entity-recognition v0.0.1 — the failed first try

How a 2× tokenization confound gave a fake AUROC=1.0

Educational "how NOT to do it" notebook. Synthetic Slavic-style fake-entity names had ~2× the token count of famous entities — even the best feature was just counting subword tokens. Posted unchanged so the failure mode is reproducible. The fix is in 24b.

~30 min · Colab T4
Colab Free
advanced

Ferrando 2024 replication on Qwen3.6-27B

AUROC 0.8379 on real Wikidata entities (vs 0.732 baseline)

The methodology fix for 24. Uses real known/unknown Wikidata entities from javiferran/sae_entities, labels via attribute recall on the 27B model, applies the Pile noise filter (>2% rate dropped), and ranks single latents by Cohen's d. Surfaces feature L11/f61723 — first proper Ferrando replication at 27B scale.

~2 h · Colab A100
Colab Pro · A100 recommended
advanced

Single-feature steering — the null result

Clamp ±5 on f61723 · no calibration effect

First steering test: clamp the entity-recognition feature to ±5 (additive ±2) at L11 and check whether refusal rate on unknown entities moves. It does not. Detection ≠ control. Sets up the multi-feature experiments in 26 / 27.

~45 min · Colab A100
Colab Pro · A100 recommended
advanced

Multi-feature steering — top-K (no controls)

−15pp on unknown refusal · would have shipped overclaimed

Ablate top-K (K∈{5,20,50,200}) features sorted by Cohen's d. Naive read: −15pp on unknown-entity refusal at K=200 — looks like a calibration knob. We almost shipped it before adding controls. The honest version is 27.

~1.5 h · Colab A100
Colab Pro · A100 recommended
advanced

Multi-feature steering with full controls

Random-K null + direction-sort + Claude judge → it induces hallucination

The walk-back. Six controls: random-K (R=30 draws), direction-sorted (top positive-d / top negative-d / mixed |d|), 3-way split, anti-feature, Claude Haiku judge, permutation test. Top-K is 4-8σ outside random null — but the judge shows the "less hedging" is confident-wrong answers (62%→77% on incorrect refusal), not improved correctness. Hallucination-induction mechanism, not a calibration knob.

~3 h · Colab A100
Colab Pro · A100 + ANTHROPIC_API_KEY
advanced

Paper baselines — Ferrando 2024 on Qwen3.6-27B

L31/f34957 0.81 · LR 0.887 · diff-of-means 0.859

The headline-numbers notebook for the ICML MI Workshop paper-1. Ferrando-style entity-recognition replication with 607 entities, per-layer scan across all 64 layers for linear probe + diff-of-means baselines, 95% bootstrap CI, HF resumable checkpoints. Cleanly compares single SAE feature vs supervised LR ceiling vs cross-bench generalization.

~3 h · Colab A100
Colab Pro · A100 recommended
advanced

Sensitivity — refusal-only vs Ferrando labelling

Same residual capture · 2 labelling rules → which signal survives?

Ablation companion to 28. Re-uses the cached residual capture, swaps labelling rule (refusal-only vs Ferrando-style confabulation-as-unknown). Builds reviewer defence: shows the L31/f34957 0.81 AUROC is robust to the labelling rule choice, falsifies an earlier "L11 best" claim from v0.0.2.

~30 min · Colab T4
Colab Free

Guards — product reproducers

Reproduce the exact numbers behind shipped openinterp Guards. Each notebook is a self-contained validation of a probe that ships in the PyPI SDK.
advanced

FabricationGuard PoC v1 — single SAE feature

How the entity feature failed cross-bench

The first attempt at HallucinationGuard: single SAE feature L31/f34957 (AUROC 0.81 on Ferrando entity test) applied to 4 public benchmarks. AUROC collapsed to ~0.5 chance on TruthfulQA / HaluEval / SimpleQA / MMLU. The honest negative result that motivated v2. Open-sourced as part of the OpenInterp methodology arc.

~1.5 h · Colab Pro+ RTX 6000
Colab Pro+ · ≥48 GB VRAM
advanced

FabricationGuard v2 — linear probe (production)

AUROC 0.88 cross-task SimpleQA · −88% confident-wrong reduction

The probe behind the shipped openinterp.FabricationGuard (PyPI v0.2.0). Multi-feature LR on residual stream at L31, trained on TruthfulQA + HaluEval + MMLU train splits, evaluated cross-task on held-out SimpleQA. AUROC 0.88. Mitigation analysis shows −52% to −88% confident-wrong reduction on factual QA. Outputs probe.joblib that ships in the SDK.

~50 min · Colab Pro+ RTX 6000
Colab Pro+ · ≥48 GB VRAM
advanced

ReasonGuard v0.1 — probe during thinking

AUROC 0.888 within math · 0.605 cross-domain · honest narrow scope

Companion to 31. Score the *reasoning trace itself* (during thinking-mode generation), not the prompt. Sweeps 3 layers × 4 positions. Best: L55/mid_think AUROC 0.888 within GSM8K. Cross-bench AUROC 0.605 on StrategyQA — domain-bound, registered honestly. First negative-result-shipped-honestly entry on ProbeBench.

~6h on RTX 6000 (rollouts + training, resumable)
Colab Pro+ · ≥48 GB VRAM
intermediate

ReasonGuard v0.2 — multi-bench combined training

FabricationGuard methodology applied to reasoning · cross-bench transfer test

Trains the production ReasonGuard probe with combined-bench methodology (GSM8K + StrategyQA + MATH together). v0.1 failed cross-bench (single-bench training). v0.2 tests whether multi-bench training closes the transfer gap, matching how FabricationGuard achieved AUROC 0.88 cross-task. Inputs nb 32 rollouts.npz; outputs probe v0.2 with per-bench held-out AUROC + comparison to v0.1.

~5 min · Colab Free (CPU-only)
Any · sklearn only
advanced

Multi-Probe DPO POC — Qwen3.6-27B + FG + RG

First OSS multi-probe-reward DPO on 27B · Goodfire RLFR pattern + orthogonal probes

Proof-of-concept: fine-tune Qwen3.6-27B with DPO using a combined multi-probe reward built from FabricationGuard (L31/end_question) + ReasonGuard (L55/mid_think). Probes run on FROZEN base (LoRA disabled) — gradient never flows through them. Anti-Goodhart by orthogonal-objective design: student must satisfy both probes simultaneously, drastically harder to game than single-probe RL. Direct extension of Goodfire RLFR (-58% halu) to multi-probe.

~2h · Colab Pro+ RTX 6000
Colab Pro+ · ≥80 GB VRAM
advanced

Anti-Goodhart fresh-probe validation

4-quadrant verdict · distinguishes real improvement from probe evasion

Companion to 35. Trains a FRESH probe on student-generated samples (post-DPO). Compares 3 numbers: (1) halu rate change, (2) original probe AUROC on student, (3) fresh probe AUROC on student. 4-quadrant matrix tells you whether DPO actually reduced hallucination or just learned to evade the original probe direction. The honest validation any probe-reward training pipeline needs.

~30 min · Colab Pro+ RTX 6000
Colab Pro+ · ≥80 GB VRAM

Safety + production

Q4 Watchtower preview.
intermediate

Watchtower preview — monitor input prompts

Detect anomalous feature activations in production traffic

Q1 preview of the Q4 Watchtower Enterprise API. Streams input prompts, measures watchlist feature activations, flags anomalies above threshold, emits dashboard-style report. Forward-only, no generation.

~5 min · any Colab
Colab · any tier
Side-by-side

Pick the tier that matches your compute.

HobbyistExplorerPaper-grade
PlatformColab Free T4Kaggle 2× T4Cloud RTX 6000 Pro
Cost$0$0 · 30 h/wk quota~$30–60 per run
VRAM15 GB32 GB96 GB
ModelGemma-2-2B (2.6 B)Qwen3.5-4B (4.0 B)Qwen3.6-27B (27 B)
ArchitectureDenseHybrid GDNDense (reasoning)
Dictionaryn=16k (7×)n=40k (16×)n=65k (13×)
TopKk=64k=128k=128 + AuxK
Tokens50 M150 M200 M
Time30–40 min4–5 h20–24 h
What you getFirst SAEHybrid-arch SAEPaper-grade SAE
The shared recipe

Every tier uses the same research-grade recipe.

The protocol scales. Only the hyperparameters change.

01

Stream activations

Hook the model's residual stream at a mid-layer. Stream a web-scale corpus (FineWeb-Edu, OpenThoughts, custom). Pack into batches, emit activation vectors.

02

TopK SAE + AuxK

Hard TopK activation (Gao et al. 2024). AuxK auxiliary loss revives dead features. Geometric-median b_dec init. Decoder columns re-normed every step.

03

Resume-safe checkpoint

HuggingFace streaming checkpoints every 5–10M tokens. If Colab idles, Kaggle kills the kernel, or the cloud instance crashes — you lose at most 10 minutes.

04

Cosine LR + warmup

5k–1k warmup to peak 2e-4, cosine decay to 6e-5 floor. Non-zero floor keeps dead-feature revival active throughout training.

05

Held-out validation

Final step: compute var_expl, L0, and dead-fraction on 500k–1M fresh tokens (different seed). Upload val_report.json — your SAE ships with receipts.

06

SAELens-compatible export

safetensors with W_enc, W_dec, b_enc, b_dec + cfg.json with architecture, d_in, d_sae, k, hook_name. Load directly in SAELens. Ready for Neuronpedia ingestion.

Stuck? Lost? Want your notebook added?

Open an issue on GitHub, email us, or propose your own tier-specific recipe. We review every PR — unusual architectures (Mamba, RWKV, diffusion-LM) especially welcome.