TRAIN · THE LADDER

Any model. Any scale.

Train your first sparse autoencoder in 30 minutes on a free Colab T4. Train a hybrid-architecture SAE in 4 hours on free Kaggle. Train paper-grade on cloud. One ladder, zero gatekeeping.

beginner
TIER 1 · HOBBYIST

Your first SAE in 30 minutes

PlatformGoogle Colab · Free T4
Cost$0
ModelGemma-2-2B
Tokens50 M
Dictionary7× (n=16k)
Time30–40 min

Train a complete TopK SAE with AuxK dead-feature mitigation on Gemma-2-2B. Drive-based checkpoint recovery handles Colab's 90-minute idle disconnect. Ends with your own SAE uploaded to HuggingFace — citable, reusable, shareable.

What you'll learn
  • Forward hooks + residual stream extraction
  • TopK activation + AuxK auxiliary loss (Gao et al. 2024)
  • Geometric-median b_dec initialization
  • HuggingFace safetensors + cfg.json format
  • Crash-safe checkpointing to Google Drive
Prerequisites
  • Google account (Colab Free access)
  • HuggingFace account + HF_TOKEN in Colab Secrets
  • Edit one line: HF_USERNAME
intermediate
TIER 2 · EXPLORER

Hybrid-architecture SAE — Qwen3.5-4B

PlatformKaggle · 2× T4 (32 GB)
Cost$0 · 30 h/wk
ModelQwen3.5-4B
Tokens150 M
Dictionary16× (n=40k)
Time4–5 h

The first-public-ready SAE recipe for hybrid GDN architectures. Installs transformers from source for qwen3_5 support, uses output_hidden_states path (Qwen3.5 has no .layers), survives Kaggle kernel-kill via HF-resumable checkpoints. Produces a publishable SAE matching the Stage Gate 1 research bar.

What you'll learn
  • Hybrid GDN activation capture (output_hidden_states)
  • transformers-from-source install + restart dance
  • Dual-GPU model/SAE split (model on cuda:0, SAE on cuda:1)
  • HuggingFace streaming checkpoints for kernel-kill recovery
  • Held-out validation + val_report.json publishing
Prerequisites
  • Completed Tier 1, or SAE experience
  • Kaggle account + HF_TOKEN in Kaggle Secrets
  • Basic understanding of Gated Delta Networks (links in notebook)
advanced
TIER 3 · PAPER-GRADE

Paper-grade SAE — Qwen3.6-27B

PlatformVast.ai / Lambda · RTX 6000 Pro (96 GB)
Cost~$30–60 / run
ModelQwen3.6-27B
Tokens200 M
Dictionary13× (n=65k)
Time20–24 h

The Gemma-Scope-27B-parity recipe. 3 TopK SAEs trained in parallel on L11/L31/L55 with a single shared forward pass, 70/20/10 FineWeb-Edu + OpenThoughts + OpenMath corpus mix, and HF streaming checkpoints every 10M tokens so a crash costs at most 10 minutes. This is the notebook behind qwen36-27b-sae-papergrade.

What you'll learn
  • Multi-layer simultaneous SAE training (one forward pass, 3 SAEs)
  • Corpus mixing for reasoning-model SAEs
  • Streaming activation buffer pattern (never OOM)
  • AuxK calibration for large n (d_model/2 heuristic)
  • sae_lens / Neuronpedia-ready export
Prerequisites
  • Completed Tier 1 + Tier 2, or production SAE experience
  • Cloud GPU account (Vast.ai / Lambda / RunPod) with ≥96 GB VRAM
  • HF_TOKEN env var on the cloud instance
Beyond the ladder

20 more notebooks for every step of your SAE journey.

Post-train discovery, one-click steering, model coverage, research replication, safety monitoring — every notebook opens in Colab or Kaggle directly.

Closes the loop

You have an SAE. Now understand it, share it, edit it.
beginner

Discover your features

Auto-label your SAE with an LLM judge

You trained an SAE. Now what? This notebook streams activations, ranks features by interestingness, sends top-activating examples to Claude or GPT-4, and returns a feature_catalog.json with 1-sentence descriptions.

~20 min · Colab T4
Colab Free · ANTHROPIC_API_KEY or OPENAI_API_KEY
beginner

Build a shareable Trace

Your SAE + your prompt → trace.json + shareable URL

Generate a TraceData JSON (exact Trace Theater schema) for a custom prompt + SAE. Emits the same format /observatory/trace consumes. Upload to HF and share the URL.

~5 min · Colab T4
Colab Free
intermediate

Steer your model

Live feature intervention — baseline vs α ∈ [-3, 0, 1, 3]

Pick a feature, slide its activation coefficient, regenerate. Shows causal effect side-by-side. Q1 preview of the Q2 Sandbox. Exports interventions.json for inclusion in Trace Theater counterfactuals.

~3 min · Colab T4
Colab Free

Reduce friction

Pick the right tier before you spin up a GPU.
beginner

Pick your tier

VRAM calculator + layer recommender

Interactive "what tier should I train?" advisor. Auto-detects your GPU, asks for time budget, recommends a notebook + model + layer. Zero GPU required.

< 1 min · CPU fine
Anywhere

More models

Same recipe, different architectures.
intermediate

Llama-3.1-8B SAE

Tier 2 port — Llama-3.1-8B on Kaggle free

Train an SAE on the most popular open model. 100M tokens on Kaggle 2× T4 in ~5-6h, HF resumable checkpoints, standard .model.layers path.

5–6 h · Kaggle 2× T4
Kaggle Free · Meta license acceptance required
intermediate

Mistral-7B SAE

Tier 2 port — Mistral-7B-v0.3 on Kaggle free

Clean decoder, sliding-window attention is transparent to SAE training. Same Kaggle recipe as Llama, swaps the model. HF resumable checkpoints.

4–5 h · Kaggle 2× T4
Kaggle Free
beginner

Phi-3-mini SAE

Tier 1 alt — even faster hobbyist path

Microsoft Phi-3-mini (3.8B) fits comfortably on Colab free T4. 20-min training, Drive checkpoints, first-feature-discovery gift-wrapped.

~20 min · Colab T4
Colab Free

Research-grade

Replicate published results. Write your paper.
intermediate

Stage Gate G1 — correlation pre-test

ρ ≥ 0.30 or don't burn GPU on RL

Replicates the Stage Gate 1 protocol from mechreward. Computes Spearman ρ between your SAE feature pack and GSM8K correctness on 100 held-out samples. Pass/fail + scatter plot + report upload.

20–30 min · Colab T4
Colab · any tier
advanced

BatchTopK vs TopK

Replicate arxiv:2412.06410

Train TopK and BatchTopK on identical activation batches, compare Pareto (var_exp, L0, dead%). Shows where BatchTopK dominates and by how much.

~45 min · Colab T4
Colab Free

Circuits

Attribution graphs between SAE features. View with /observatory/circuits.
intermediate

Attribution Patching (AtP*)

Kramár 2024 — QK-fix + GradDrop · node attribution

Compute per-feature attribution scores on your SAE using AtP* (the 2-forward-1-backward linearization). Mean-ablation baseline, QK-fix for attention heads, GradDrop for sign-cancellation robustness. Emits circuit JSON for the Canvas viewer.

~15 min · Colab T4
Colab Free
advanced

Sparse Feature Circuits (Marks 2024)

arxiv:2403.19647 replication · node + edge DAG

Full replication of Marks et al. 2024. Node attribution via AtP + IG-10 fallback for early layers. Edge attribution via Appendix A.1 (upstream decoder × downstream encoder × upstream delta × downstream gradient). SAE error terms as triangle nodes.

~20 min · A100
Colab Pro A100
advanced

ACDC slow-mode via AutoCircuit

NeurIPS 2023 algorithm · independent verification

Run the original ACDC algorithm (Conmy 2023) using AutoCircuit (UFO-101 — the practitioner-default fork). Slower than AtP but peer-reviewed. Compare faithfulness curves across methods. Emits circuit.json compatible with the Canvas viewer.

1–2 h · Colab T4
Colab · any tier
advanced

Train a Sparse Crosscoder

Lindsey 2024 · shared dictionary across 3+ layers

Train a single crosscoder that reads and writes across multiple residual layers simultaneously. Unifies L11/L31/L55-style multi-layer SAEs into one feature index. Greenfield — not yet in SAELens. Classifies features as persistent / early-only / late-only / mixed.

~30 min · T4 (20M tok) · scales to paper-grade
Colab Free · T4

Leaderboard

Rank your SAE on the public InterpScore leaderboard.
intermediate

InterpScore v0.0.1 — rank your SAE

Composite metric · submit to the leaderboard

Compute the InterpScore of your SAE: loss_recovered + alive features + L0 sweet spot + sparse probing + TPP causal faithfulness. Emits interpscore.json, ready to PR into the public leaderboard at openinterp.org/interpscore.

~20 min · Colab T4
Colab Free · Gemma-2-2B default

Lenses

Classic tools — see what each layer is predicting.
beginner

Logit Lens — per-layer predictions

nostalgebraist 2020 · 5 lines of PyTorch

Apply final_ln + unembed to every intermediate residual stream. See what the model is "thinking" at each depth. Pure transformers — no TransformerLens dep. Handles Llama/Gemma/Qwen/GPT-2/multimodal paths.

~5 min · Colab T4
Colab Free
intermediate

Tuned Lens — calibrated predictions

Belrose 2023 · pretrained or fresh-fit

Per-layer affine transformation that fixes Logit Lens under-specification. Tries pretrained checkpoints first (GPT-2, Pythia, Llama-3-8B, OPT, Vicuna); falls back to 200-step fresh training on the Pile (~20 min on T4).

2 min (pretrained) · 20 min (fresh fit)
Colab Free

Probing

The supervised baselines that SAE features must beat.
beginner

Linear Probe — the SAE baseline

Alain & Bengio 2016 · the indispensable baseline

Fit sklearn LogisticRegression on residual-stream activations. Per-layer AUROC sweep. Diff-of-means baseline shipped (Farquhar 2023 critique). This is the number any SAE feature-pack must beat.

~10 min · Colab T4
Colab Free
intermediate

CCS — Contrast Consistent Search

Burns 2022 · unsupervised truth-probing, with honest baselines

Replicates Burns et al. 2022 CCS on IMDB or TruthfulQA. Ships diff-of-means + supervised LR ceiling alongside CCS per Farquhar 2023 critique. Best-of-10 restarts. Honest verdict when CCS adds no value over diff-of-means.

~15 min · Colab T4
Colab Free
intermediate

RepE reading vector (LAT)

Zou 2023 · extract + monitor + steer a concept

Linear Artificial Tomography. 32 contrastive prompt pairs → PCA → first component is the "honesty" / "sycophancy" / "refusal" / "confidence" direction. Monitor new prompts. Confirmed causal via ±α steering at the end.

~10 min · Colab T4
Colab Free

Safety + production

Q4 Watchtower preview.
intermediate

Watchtower preview — monitor input prompts

Detect anomalous feature activations in production traffic

Q1 preview of the Q4 Watchtower Enterprise API. Streams input prompts, measures watchlist feature activations, flags anomalies above threshold, emits dashboard-style report. Forward-only, no generation.

~5 min · any Colab
Colab · any tier
Side-by-side

Pick the tier that matches your compute.

HobbyistExplorerPaper-grade
PlatformColab Free T4Kaggle 2× T4Cloud RTX 6000 Pro
Cost$0$0 · 30 h/wk quota~$30–60 per run
VRAM15 GB32 GB96 GB
ModelGemma-2-2B (2.6 B)Qwen3.5-4B (4.0 B)Qwen3.6-27B (27 B)
ArchitectureDenseHybrid GDNDense (reasoning)
Dictionaryn=16k (7×)n=40k (16×)n=65k (13×)
TopKk=64k=128k=128 + AuxK
Tokens50 M150 M200 M
Time30–40 min4–5 h20–24 h
What you getFirst SAEHybrid-arch SAEPaper-grade SAE
The shared recipe

Every tier uses the same research-grade recipe.

The protocol scales. Only the hyperparameters change.

01

Stream activations

Hook the model's residual stream at a mid-layer. Stream a web-scale corpus (FineWeb-Edu, OpenThoughts, custom). Pack into batches, emit activation vectors.

02

TopK SAE + AuxK

Hard TopK activation (Gao et al. 2024). AuxK auxiliary loss revives dead features. Geometric-median b_dec init. Decoder columns re-normed every step.

03

Resume-safe checkpoint

HuggingFace streaming checkpoints every 5–10M tokens. If Colab idles, Kaggle kills the kernel, or the cloud instance crashes — you lose at most 10 minutes.

04

Cosine LR + warmup

5k–1k warmup to peak 2e-4, cosine decay to 6e-5 floor. Non-zero floor keeps dead-feature revival active throughout training.

05

Held-out validation

Final step: compute var_expl, L0, and dead-fraction on 500k–1M fresh tokens (different seed). Upload val_report.json — your SAE ships with receipts.

06

SAELens-compatible export

safetensors with W_enc, W_dec, b_enc, b_dec + cfg.json with architecture, d_in, d_sae, k, hook_name. Load directly in SAELens. Ready for Neuronpedia ingestion.

Stuck? Lost? Want your notebook added?

Open an issue on GitHub, email us, or propose your own tier-specific recipe. We review every PR — unusual architectures (Mamba, RWKV, diffusion-LM) especially welcome.