Back homeMANIFESTO · 2026-06

The microscope is built. Now we need the standards.

The thesis

OpenInterpretability is building the audit layer for mechanistic-interpretability claims in agentic systems — the rigor that tells a real signal from a confound. One question, asked honestly: when should we believe a mech-interp claim?

Anthropic published Persona Vectors and Tracing Thoughts. DeepMind shipped Gemma Scope. Alibaba shipped Qwen-Scope. Neuronpedia built the encyclopedia. Goodfire raised $150M to commercialize the substrate. The interpretability microscope is, finally, a thing that exists.

What does not exist yet is the standard above it: the methodology that distinguishes a probe that learned the underlying signal from a probe that learned a confound. That gap is not academic. In our own work a covert-intent probe scored AUROC 0.98 — and collapsed to 0.52the moment we held the framing constant and cross-validated by trajectory. The number was real. The signal was not. Almost every “our interpretability tool detects X” claim has a 0.98 hiding a 0.52, and the field has no shared, runnable way to catch it.

That is the gap OpenInterp fills. We don’t train more SAEs — frontier labs already do that better than we ever could. We build the layer that decides whether to believe the microscope: an agent-trajectory capture pipeline, a confound auditor, and a 15-paper arc of pre-registered findings — every probe inspectable, every methodology re-runnable, every claim citable. Apache-2.0 throughout. Anti-Goodhart by construction.

What the arc found

We studied one hard problem deeply: why capable LLM agents loop forever and never finish, and whether their internals can tell us — or change it. Nine beats, each a permanent Zenodo DOI, each an honest one-liner:

01WANDERING 02It is finalization, not competence 03Detect ≠ control 04The lever is late 05It generalizes — and it brakes 06The authorization direction 07Felt, not granted 08The late channel 09Located, not secured 10The criterion cannot see what it does not measure 11The action lags the answer

The arc bends one way. Interpretability locates a real, causal control surface — a late action band where an agent commits — but it does not secure it. Detection is not control, even at the exact named feature. An internal authorization monitor reads the authorization the model feels, not the one the user granted. The brake that suppresses an irreversible action collapses under an adaptive white-box attack. Five orthogonal limits, one conclusion:

Use interpretability to audit and monitor a fixed model — not to defend against an adversary optimizing against a known locus, and not as a shortcut to building a better model. Locating where behavior is decided is necessary, and nowhere near sufficient, for securing it.

That is an unfashionable position, and it is the one the evidence supports. The honest frontier read in 2026: interpretability is strong for understanding and auditing, and weak for building and defending— the genuine engineering wins attributed to “interp” come from observable attention/activation structure plus a cheap optimization step, not from circuits or SAE features doing the work. We say so out loud. The lab that tells you when not to believe a result is the one worth believing when it says you can.

The flagship: `oilab audit`

The thesis, made runnable. oilab audit takes a probe or direction and a labeled activation set, and runs a confound battery offline, on CPU, in seconds — permutation null, random-direction floor, leave-one-rollout-out and leave-one-group-out, a structure-matched control, leakage residualization, distribution-shift transfer. It returns one verdict card: REAL_SIGNAL only if the honest cross-validated number survives every applicable check; otherwise CONFOUNDED or UNDETERMINED. It is the codified version of the discipline that turned our own 0.98 into a 0.52 before we could publish it. The rigor a frontier lab applies by hand, as one command anyone can run.

Why the claims are trustworthy

The discipline, not the marketing:

We publish our own nulls.

Six pre-registered walk-backs across the arc. A negative result, reported as a negative result, is the unit of progress.

Every number is recomputed.

Each paper ships an eval script that recomputes every figure from the public ledgers (35/35, 54/54, 88/88) plus a web-verified citation check.

Permanent + reproducible.

Zenodo DOIs, public GitHub, Hugging Face datasets, and one-command replication via openinterp-lab on the Colab CLI.

Depth over breadth.

One open-weights reasoning model (Qwen3.6-27B) studied deeply, with cross-architecture checks (gpt-oss-20b, Llama-3.1) where the claim is universal.

What we uniquely bring

Every claim below is grounded in a shipped, Apache-2.0 licensed, public artifact:

Real agent-trajectory activation capture — per-token internals captured on real long-horizon coding-agent runs (SWE-bench Pro) and on a ported SHADE-Arena. Almost nobody combines SAE/circuit interpretability with activation capture on realagent trajectories — that is where the arc’s findings come from.
First public SAE on the Qwen3.6 family — full-stack, 11 layers on the 27B reasoning model; plus first SAEs on hybrid architectures (Gated Delta Networks, triple-hybrid MoE). Verified against Hugging Face at shipping time.
oilab audit — the confound auditor — the only runnable, offline standard for telling a real probe signal from a confound. Reproduces our own airtight falsification, byte-for-byte, on CPU.
Honest negatives — six pre-registered walk-backs across the arc, including the result that detection is not control and the white-box monitoring advantage we tried to prove and could not. Trust comes from admitting what broke.

A second research line — full-stack SAE training, mechanistic reward modeling, and sub-4-bit quantization for cheap open-weights inference — funds and feeds the agent-safety arc above.

The first-minute experience

A researcher with a probe that scores 0.9, on a laptop, in one command — runs oilab audit and learns, before they believe it, before they publish it, before they ship it, that the signal is a confound. Or that it is real, and now provably so.

That is the north star. Everything — the replicable arc, the agent-capture pipeline, the confound battery, the recompute-every-number harness — is optimized toward turning “the model probably does X” into “here is the check, run it yourself.”

Frontier labs build the microscope. We build the standard that tells a finding from a confound.