Research

Papers, posts, and roadmap.

Every artifact is open. Negative results are publishable. Broken links, stale numbers, or methodological bugs get flagged and fixed in public.

Open artifacts

SDKs, probe weights, SAE / crosscoder models, and reproducible harnesses. Apache-2.0 throughout. Every artifact is paired with a paper or eval doc that documents how it was built and what it's for.

agent-probe-guard SDK v0.3.1

Live · Apache-2.0

PyPI · openinterp · HF dataset · GitHub release

Two-probe activation gate for code agents on Qwen3.6-27B (capability + thinking-intent). Skip / escalate / proceed routing, ~50ms gate, refit() helper for cross-env transfer. Detect-only by design — confirmed across 3 intervention experiments.

Read

agent-probe-guard probe weights

Live · Apache-2.0

HF dataset · caiovicentino1/agent-probe-guard-qwen36-27b

L43 K=10 capability + L55 K=5 thinking probes for Qwen3.6-27B. CV AUROC 0.83 / 0.85 with random K-matched gap +0.08 / +0.15. Cross-env transfer caveat documented in paper Appendix C.

Read

NLA two-tier verbalization — three-model dataset

Live · Apache-2.0

HF dataset · caiovicentino1/openinterp-paper7-nla-two-tier-verbalization

Reproducibility artifacts for the "Reconstruction Without Recall" paper — Phase 16 result JSONs (N=150 per model) + AV explanations across V1 Qwen2.5-7B-L20, V2 Gemma-3-12B-L32, V3 Gemma-3-27B-L41 NLA pairs from the kitft release. Documents the three differential scaling axes and format-prior contraction.

Read

Probe-detected grokking — DPO sweep dataset

Live · Apache-2.0

HF dataset · caiovicentino1/openinterp-41v2-grokking-extended

Forward-only sweep across 11 multi-probe DPO checkpoints on Qwen3.6-27B. Fresh-probe AUROC trajectory 0.472 → 0.528 with late/early slope ratio 2.60 — the grokking signature orthogonal to original FG/RG axes. Companion to the multi-probe DPO checkpoints (caiovicentino1/openinterp-37v2-multiprobe-dpo-extended).

Read

Qwen3.6-27B paper-grade SAE

Live · Apache-2.0

HF model · caiovicentino1/qwen36-27b-sae-papergrade

Multi-layer SAE on Qwen3.6-27B trained on 200M tokens. L11 ve=0.842 / L31 0.706 / L55 0.808. Only public SAE for Qwen3.6 reasoning model as of May 2026. Used by FabricationGuard, ReasonGuard, agent-probe-guard.

Read

Qwen3.6-27B full-stack SAE (PGAC)

Live · Apache-2.0

HF model · caiovicentino1/qwen36-27b-sae-fullstack

Eleven-layer TopK SAE stack covering every 5th layer of Qwen3.6-27B for Probe-Gated Adaptive Compute (PGAC). d_sae 40,960, k=128, 200M tokens. Substrate for the 3.57× compound inference speedup at iso-SAE-floor result.

Read

Gemma-2-2B base/IT crosscoder (paper-grade)

Live · Apache-2.0

HF model · caiovicentino1/gemma2-2b-crosscoder-model-diff-papergrade

BatchTopK crosscoder, 73,728 latents, k=100, expansion 32, layer 13. Trained on 100M tokens (FineWeb-Edu + UltraChat-200k). VE_A 0.877 / VE_B 0.867. Substrate for the Cosine-Causal Gap paper.

Read

FabricationGuard linear probe

Live · Apache-2.0

HF dataset · caiovicentino1/FabricationGuard-linearprobe-qwen36-27b

Linear probe at L31 on Qwen3.6-27B residual stream. HaluEval within 0.90, SimpleQA cross 0.88, −88% confidently-wrong reduction in SimpleQA. Powers the FabricationGuard product and the multi-probe DPO joint reward.

Read

SWE-bench Pro reproducible harness

Live · Apache-2.0

GitHub · OpenInterpretability/openinterp-swebench-harness

V1 harness for collecting agent rollouts on SWE-bench Pro with residual-stream captures (transformers direct, forward hooks). Phase 1 N=20 + Phase 6 N=99 in flight. Substrate for Two Forms Epiphenomenal Probes paper.

Read

Plus ~60 SAEs, probes, datasets, and intermediate artifactson Hugging Face — every published paper's reproducibility data, training checkpoints, and probe weights.

Browse all on Hugging Face ·GitHub org

Our papers

26 hosted on site · full text

Drafts, eval docs, and submitted papers authored by OpenInterp. Markdown source mirrored from the research repos so you can read the exact text without leaving the site.

Explore-Consolidate Dynamics in Cross-Probe Coherence

draft

U-Shape Trajectories of κ_t Separate Successful and Failed LLM Agent Runs on SWE-bench Pro / Qwen3.6-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft v2) → ICLR 2027 main · 2026-05-18

We propose cross-probe coherence κ_t — the mean absolute pairwise correlation across N concurrent per-turn behavioral probes within a moving window of agent turns — as a meta-signal for LLM agent monitoring. On 99 SWE-bench Pro trajectories from Qwen3.6-27B we report two distinct findings. (1) Per-trace mean κ̄ separates success from failure at AUROC 0.677 (Mann-Whitney p=0.0009). (2) κ_t exhibits a U-shape over each trajectory: it decreases through an early exploration phase and increases through a late consolidation phase, and the amplitude of this U-shape is markedly larger in successful traces (early-half slope −0.0078/turn vs −0.0007/turn, p=0.0002; late-half slope +0.0149/turn vs +0.0025/turn, p=0.00004). A pre-registered robustness control (C1) found the monolithic per-trace slope is substantially explained by trace-length confound (p=0.56 after length regression), motivating the length-normalized early-half/late-half decomposition. Within-trace turn-order shuffle nulls (C2) confirm the U-shape is genuinely temporal (p<0.0001). The pattern is the inverse of cardiac uncoupling: in ICU literature, cross-vital decorrelation anticipates physiological decompensation; in LLM agents, cross-probe trajectories oscillate during successful reasoning (explore→consolidate) and remain flat during failure. We document the methodological discipline — pre-registered gates that caught both five prior single-probe walk-backs in the 36 hours before this finding and this paper's own headline-confound — that gives the rescued temporal claim its credibility.

cross-probe correlationmeta-signalLLM agent monitoringSWE-bench ProQwen3.6-27B

Conditionally-Causal Probes: Five Operational Constraints on Linear-Probe Causality in Qwen3.6-27B

draft

An eleven-site empirical map, a unifying operational-constraints framework, and a pre-publication diagnostic battery — derived from four prior honest negatives

Caio Vicentino · TMLR (Survey Certification) → ICLR 2027 main (draft v1) · 2026-05-16

Linear probes on transformer residual streams routinely achieve high predictive AUROC, yet whether a probe direction also levers downstream behavior under intervention is rarely measured systematically. We report a twelve-site causal-authority map of probes in Qwen3.6-27B (reasoning-tuned, 27B parameters), comprising eleven probes evaluated under a unified α-sweep + control-token + onset-timing protocol plus one predictive case study, and identify five distinct empirical causal regimes: causal trajectory-shaping, pushup-asymmetric, pushdown-asymmetric, structurally-locked, and epiphenomenal-via-softmax-temperature. We propose that probe causality is operationally constrained by a five-axis configuration — layer (spatial), trajectory (temporal), magnitude (α), direction (saturation alignment), and decision locus (architectural) — and demonstrate each constraint with a within-paper falsifying experiment that holds the other four fixed. We then consolidate the methodology that surfaced these constraints into a six-item pre-publication diagnostic battery: random-feature baseline, shuffled-source baseline, control-token normalization, structural-rigidity α-sweep, whitespace-stripped flip metric, and onset-timing sweep. Each diagnostic is mapped to a concrete failure mode we shipped or nearly shipped in our own work: over-parameterization at N<100, marginal-fit pathology in sparse top-k prediction, softmax-temperature artifacts that look causal, amplitude-null masquerading as structural-null, tokenization-inflated flip rates, and trajectory-versus-state confusion. Together the diagnostics cost under one GPU-hour per probe. We release the protocol, capture batches, per-probe verdicts, and an open-source SDK that implements the diagnostics, and argue that the field's growing reliance on probe-based monitoring, reward shaping, and alignment auditing should treat probe causality as a conditional property to be measured per deployment configuration, not a global per-probe attribute.

meta-paperprobe causality taxonomyoperational constraintspre-publication diagnosticsQwen3.6-27B

Trajectory-Shaping Probe Steering in Qwen3.6-27B Reasoning

draft

Causal, Cross-Domain, and KV-Cache-Bound — a Subjective-Time Direction with Operational Constraints

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft v2.1) · 2026-05-16

We identify a causally functional subjective-time direction in the residual stream of Qwen3.6-27B (open-weights, 27B parameters, hybrid Gated-Delta-Net plus standard attention), validate it across math (GSM8K) and code (SWE-bench Verified) reasoning, and characterize a fundamental operational constraint on its causal effect: the steering intervention works only when applied continuously from generation start. A Ridge regression probe trained on residuals at L11/L31/L55 predicts thinking-phase completion with R²=0.82-0.86 (Spearman ρ ≥ 0.90); three baselines (random-feature, shuffled-target, constant-mean) cleanly fail. Forward-hook steering at L31 with α=+50 from token 1 shortens GSM8K thinking-length in 9/14 prompts vs 2/14 for matched random (Fisher p=0.0092). Cross-domain: 19/20 (95%) probe-clean-termination on SWE-bench Verified across 6 repositories vs 6/20 (30%) random (Fisher p<0.001), at mean 299 thinking-tokens vs unbounded baselines (0/10 terminate even at MAX_NEW_TOK=2048). The mechanism is trajectory-dependent: delayed steering — even at decode step 50 — drops termination from 9/10 to 3/10; by step 200, the rescue effect vanishes entirely (0/10). Two closed-loop variants (probe-as-sensor with threshold trigger + plateau detector) achieve only 1-2/10 termination, confirming that the 'termination basin' is mediated through KV-cache state buildup rather than instantaneous residual perturbation. Phase 2C cross-layer test further establishes that the direction is causal ONLY at L31 — L11 (R²=0.84) and L55 (R²=0.82) are inert despite equivalent probe accuracy. This refines the probe-causality taxonomy with a third category beyond 'causal' / 'epiphenomenal': operationally-constrained causal — directions that lever behavior only under specific application protocols (temporal: from token 1; spatial: specific layer only).

probe steeringtrajectory-dependentKV-cachesubjective timeQwen3.6-27B

The Marginal-Fit Pathology in Predictive SAE Feature Trajectory Probes

draft

An Honest-Negative on Predicting End-of-Thinking SAE Features in Qwen3.6-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-16

We train linear probes to predict end-of-thinking sparse-autoencoder (SAE) features in Qwen3.6-27B from residual activations at earlier thinking-phase fractions, across three layers (L11, L31, L55). Naive evaluation reports recall@1024 = 0.83-0.87 at L11/L31 and 0.67-0.72 at L55. We then run a shuffled-source baseline (X_train shuffled, y_train kept, identical recipe) and observe that the baseline reproduces the real recall within ±0.03 at all 12 (layer × source-fraction) sites, with Cohen's d < 0.15. A trivial constant baseline that predicts the top-M most-globally-common features ignoring input strictly exceeds the trained probe (1.000 at L11/L31, 0.991 at L55). The probe is not learning per-prompt predictive structure — it is fitting the marginal distribution of end-of-thinking SAE features and approximating an input-independent constant rule. We name this the marginal-fit pathology, identify five structural conditions that produce it (sparse top-k target + concentrated marginal + N_train << d_target + lazy loss + recall-style metric), contribute the shuffled-source baseline as a Phase 6c-class hard rule for sparse-target probe-prediction work, and reframe predictive-probe agendas — including JEPA-shaped LLM experiments — toward differential metrics (REAL − SHUFFLED) from day one rather than absolute recall.

sparse autoencoderspredictive probeshonest negativemarginal-fit pathologyshuffled-source baseline

Reconstruction Without Recall

draft

Two-Tier Verbalization in Natural Language Autoencoders — Three-Model Differential Scaling on Qwen2.5-7B, Gemma-3-12B, and Gemma-3-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-09

Natural Language Autoencoders (Fraser-Taliente et al. 2026) train an activation-verbalizer (AV) and an activation-reconstructor (AR) end-to-end with GRPO so that round-trip MSE between original and AR-reconstructed activations serves as a learnable explanation-quality reward. We replicate the canonical recipe on three NLA pairs from the kitft release spanning two model families and three scales — kitft/nla-qwen2.5-7b-L20, kitft/nla-gemma3-12b-L32, and kitft/nla-gemma3-27b-L41 — and show that the headline metric, fve_nrm, decouples from semantic content fidelity across all three models, with three differential scaling axes that sharpen the methodological position. On a 50-prompt corpus across 4 categories (chat / code / agent / reasoning) at K=3 samples (N=150 per model), fve_nrm is uniform at high absolute level (Qwen 0.880 / Gemma-12B 0.992 / Gemma-27B 0.982; spreads 0.017 / 0.005 / 0.010) while keyword recall varies 6.5–8.8× across categories (Qwen spread 0.490 → Gemma-12B 0.649 → Gemma-27B 0.654). The trajectory reveals: (a) overall content-fidelity signal-above-floor grows monotonically with NLA training quality (permutation gap +0.27 → +0.38 → +0.43, no ceiling visible); (b) per-category recall spread saturates between 12B and 27B at a training-distribution-imbalance ceiling; (c) Tier 1 fve_nrm peaks at moderate model size then slight regression at 27B, suggesting layer-extraction-dependent quality. Three controls validate on all three: permutation, random Gaussian (collapse to fve_nrm = −0.949 → −0.992 → −1.000 with exact orthogonal cosine, while AV produces increasingly contracted format templates from heterogeneous formats to single 'Educational article' hyper-template attractor), and direction-injection (4/4, 3/4, 3/4 self-category alignment with negation symmetry — agent failure mode model-specific: Gemma-12B agent → code, Gemma-27B agent → chat under format-prior contraction). Two-tier thesis: Tier 1 (format/category) is direction-modulated and what fve_nrm measures; Tier 2 (specific content) is largely unencoded. Better NLA training makes fve_nrm less, not more, informative about per-category Tier 2 quality.

natural language autoencodersactivation decodinginterpretability evaluationformat priorsdecoupling magnification

Activation-Bounded Chain-of-Thought Monitorability

draft

Template-locked reasoning decisions and the structural ceiling on text-only CoT monitoring in Qwen3.6-27B

Caio Vicentino · Position paper, May 2026 · 2026-05-09

Chain-of-thought monitoring has emerged as a leading candidate for scalable AI safety oversight: if a long, serial reasoning process must pass through a textual trace, then reading that trace should reveal what the model is thinking. We argue this view is structurally incomplete in instruction-tuned reasoning models. The chat template that activates thinking mode in Qwen3.6-27B injects a fixed <think></think> token pair into the input itself; the decision to think at all is encoded in the prompt before the residual stream encodes anything else. We document this template-lock experimentally (bidirectional α-sweep up to ±200 in the L55 thinking probe direction produces zero behavioral change) and contrast it with three other reasoning loci where decisions are residual-stream-encoded and steerable: capability deployment (+33-40pp pushdown gap across distributions at α=−100), persona (+60pp pushdown at α=−200), and mid-reasoning quality (+30pp pushup at α=+200). We use this evidence to argue for an activation-derived monitorability bound: text-only CoT monitoring cannot, by construction, observe decisions made before the residual stream encodes them. Activation-derived monitoring is the structurally complementary half of any complete monitoring strategy. We discuss implications for the Frontier Model Forum's January 2026 issue brief and Anthropic's 2027 detection goal.

chain-of-thoughtmonitorabilityAI safetyactivation steeringtemplate-lock

Saturation-Direction Lever

draft

A Five-Class Taxonomy of Probe Causality in Qwen3.6-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-09

Linear probes routinely achieve high predictive AUROC, yet their causal authority — whether their direction levers downstream behavior — has been only sparsely tested at frontier scale. We map probe causality across 8 probes (5 layers, 5 positions, 3 training-objective classes) on Qwen3.6-27B using a unified protocol combining bidirectional α-sweep up to α=±200, random K-matched control direction, control-token-normalized log-prob shifts, structural-rigidity diagnostic, and whitespace-stripped behavioral flip metric. We document five empirical classes of probe-causality regime and identify a single unifying principle — probes lever in the saturation direction of the baseline residual — that explains all observed asymmetric-lever cases including a falsified prediction. The classes are: (1) surface softmax-temperature artifact (L43 capability), (2) template-locked categorical decision (L55 thinking emission, L31 fabrication-detection), (3) structural fragility at fragile layers (L11/L43 think_start), (4a) pushup-asymmetric lever for reasoning quality at high amplitude (RG L55 mid_think, +30pp gap), and (4b) pushdown-asymmetric lever for capability and persona at high amplitude (5 sites, +30 to +60pp gap). We falsify the naive prediction that continuous-gradient probes lever in the pushup direction by demonstrating that persona — a continuous gradient — levers in the pushdown direction when the test prompt's baseline is in the helpful saturation region.

linear probescausal interpretabilitysaturation directionasymmetric leverQwen3.6-27B

The Cosine–Causal Gap in Cross-Model Crosscoders

draft

When Decoder Universality Overstates Causal Equivalence in Gemma-2-2B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-08

We provide the first per-feature empirical measurement of the gap between decoder cosine universality and causal-equivalence in cross-model crosscoders. Training a paper-grade BatchTopK crosscoder (73,728 latents, k=100) on Gemma-2-2B base/IT at layer 13, we measure Pearson correlation between two-model KL trajectories under per-feature ablation across 256 probes. Median decoder cosine 0.965 vs median Pearson_CE 0.616, with 38.24% of shared features having cosine > 0.7 yet CE < 0.5. Outliers in both tails — anti-aligned decoders with equivalent causal effect, and aligned decoders with opposite causal effect — show that cosine is neither necessary nor sufficient for causal equivalence. We propose Pearson_CE as a mandatory complementary diagnostic for crosscoder universality claims and release all artifacts under Apache-2.0.

crosscodersSAEuniversalitycausal equivalenceGemma-2

Probe-Detected Grokking in Multi-Probe DPO

draft

Orthogonal Learning Beyond Task-Specific Detectors in Qwen3.6-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-08

We report a phase-transition-like dynamic in multi-probe Direct Preference Optimization on a 27B-parameter reasoning model, observable only through fresh probes trained after the fact. Original probes (FabricationGuard at L31, ReasoningGuard at L55) used as the joint preference signal remain invariant across training (variance 7×10⁻⁸, ~40× below within-step noise) despite a 0.234 DPO loss descent and 0.654 logit-difference. A fresh probe re-trained on each checkpoint reveals a smooth, accelerating progression in AUROC from 0.472 → 0.528 with a late-half-to-early-half slope ratio of 2.60 — the construct-then-compress signature of grokking dynamics, but with a compression target orthogonal to the original probes. We argue this is a structural Goodhart phenomenon specific to probe-derived rewards, propose fresh-probe AUROC progression as a complementary safety evaluation, and release training checkpoints, probes, and reproducer code under Apache-2.0.

grokkingDPOGoodhartprobesQwen3.6-27B

Two Forms of Epiphenomenal Probes in Code Agents

draft

Mid-Reasoning Capability and Chain-of-Thought Emission in Qwen3.6-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-08

We train linear probes on residual-stream activations of Qwen3.6-27B and obtain two correlative findings (capability AUROC 0.83 at L43 pre_tool, CoT emission AUROC 0.85 at L55). Three intervention experiments show both are detection-only via two distinct mechanisms: softmax-temperature artifact (L43) and template-locked decision (L55). We contribute three sanity checks (random-K baseline, control-token normalization, structural-rigidity α-sweep) and ship agent-probe-guard, an Apache-2.0 SDK that markets the correlative finding without overclaiming. Cross-environment eval reveals probe weights are coupled to inference setup; v0.3.1 ships a refit() helper.

Open artifacts

agent-probe-guard SDK v0.3.1

agent-probe-guard probe weights

NLA two-tier verbalization — three-model dataset

Probe-detected grokking — DPO sweep dataset

Qwen3.6-27B paper-grade SAE

Qwen3.6-27B full-stack SAE (PGAC)

Gemma-2-2B base/IT crosscoder (paper-grade)

FabricationGuard linear probe

SWE-bench Pro reproducible harness

Our papers

Explore-Consolidate Dynamics in Cross-Probe Coherence

Conditionally-Causal Probes: Five Operational Constraints on Linear-Probe Causality in Qwen3.6-27B

Trajectory-Shaping Probe Steering in Qwen3.6-27B Reasoning

The Marginal-Fit Pathology in Predictive SAE Feature Trajectory Probes

Reconstruction Without Recall

Activation-Bounded Chain-of-Thought Monitorability

Saturation-Direction Lever

The Cosine–Causal Gap in Cross-Model Crosscoders

Probe-Detected Grokking in Multi-Probe DPO

Two Forms of Epiphenomenal Probes in Code Agents

Pre-flight Probe Eval v6 — Phase 8 Template-Lock Verdict

Six Diagnostics, Six Walk-Backs

Tool-Entropy Collapse: A Cross-Architecture Signature of Agent WANDERING Failure

Causal Localization of Agent WANDERING to Edge-Layer L11: The Right Locus Is Still Not a Rescue Lever

Multi-Channel Mechanistic Signatures of Agent WANDERING: Classification, Causal Localization, and Behavior-Legible Response to Intervention

Modality Matters: A Transient Behavioral Interruption Rescues Agent WANDERING Where Residual Steering Does Not

No Better Than Behavioral: A Residual Velocity-Freezing Fingerprint Predicts Agent WANDERING No Better Than the Cheap Tool-Entropy Detector

The Verdict Is Not the Lever: An Interpretable Task-Completion Feature Predicts but Does Not Cause Long-Horizon Agent Termination

The Lever Is Late: Causal Control of Long-Horizon Agent Termination Lives in a Task-Matched, Late Action-Commitment Block

The Lever Generalizes — and It Brakes: A Late, Bidirectional Action-Commitment Lever Across Agent Decisions and Architectures

Mechanistic Circuit-Breakers Generalize Across Irreversible Agent Actions and Architectures

The Authorization Direction: A Late-Layer Direction that Detects and Controls an Agent's Commitment to Unauthorized Irreversible Actions, Across Architectures

Felt, Not Granted: An Internal Authorization Probe Inherits the Agent's Judgment Error and Is Operationally Blind to Realistic Model-Origin Over-Reach, Where an External Task-Grounded Check Is Not

The Late Channel: Chain-of-Thought Becomes Causal and Decodable Only Late in a 27B Reasoning Agent

Located, Not Secured: Principled Limits of Interpretability-Based Control over Agent Actions

The Criterion Cannot See What It Does Not Measure: Auditing Capability-Guided Attention Hybridization Against a Named Agent-Commitment Circuit

Roadmap

Further reading

SAE foundations & scaling

Foundations

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Interim Report: Taking Features Out of Superposition

Scaling SAEs

Scaling and Evaluating Sparse Autoencoders

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Gemma Scope: Open SAEs Everywhere All At Once on Gemma 2

Architecture variants

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU SAEs

Improving Dictionary Learning with Gated Sparse Autoencoders

BatchTopK Sparse Autoencoders

ProLU: A Nonlinearity for Sparse Autoencoders

Transcoders & cross-coders

Transcoders Find Interpretable LLM Feature Circuits

Sparse Crosscoders for Cross-Layer Features and Model Diffing

Circuit Tracing: Revealing Computational Graphs in Language Models

Evaluation & benchmarks

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

A is for Absorption: Studying Feature Splitting and Absorption in SAEs

Training tricks & auto-interp

Open Source Automated Interpretability for SAE Features

Efficient Dictionary Learning with Switch Sparse Autoencoders

Circuits & attribution

Early circuit analysis

A Mathematical Framework for Transformer Circuits

In-context Learning and Induction Heads

Interpretability in the Wild: A Circuit for IOI in GPT-2 Small

Progress Measures for Grokking via Mechanistic Interpretability

Attribution patching

Attribution Patching: Activation Patching At Industrial Scale

AtP*: An Efficient and Scalable Method for Localizing LLM Behaviour

Attribution Patching Outperforms Automated Circuit Discovery

Circuit discovery algorithms

Towards Automated Circuit Discovery for Mechanistic Interpretability (ACDC)

Finding Transformer Circuits with Edge Pruning

Information Flow Routes: Automatically Interpreting Language Models at Scale

Sparse feature circuits

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs

Anthropic 2025 — biology of an LLM