ReproduceReplicate any paper in the arc with one command:oilab replicate lever-is-lateGitHub →

Independent mechanistic-interpretability lab · 10 papers · 6 honest walk-backs · permanent DOIs

When should we believe a mech-interp claim?

An independent lab studying how LLM agents fail on long-horizon tasks — and what their internals do, and don’t, reveal. Ten papers on one model. Six of our own claims walked back.

We publish positives and nulls with the same rigor: pre-registered, every number recomputed from public data, permanent Zenodo DOIs. This is the WANDERING arc — from “agents that never finish” to “the authorization a model feels is not the one you granted.”

Read the research arc 15 papers · permanent DOIs Code & data →

The action lags the answer.

The verbalizable “global workspace” reaches an agent’s tool commitment strictly deeper than its answer — a depth band steers the answer but not the action. Holds across a dense and an MoE model.

The lever is late.

In a long-horizon agent, knowledge consolidates ~30 layers before the action is committed. The knowledge–action gap is a depth gap.

Detect ≠ control.

A feature can predict a behavior at AUROC ≈ 1 and not cause it — even the exact feature, clamped at its own value.

Felt, not granted.

An internal authorization monitor inherits the model’s judgment error: it is blind to the realistic over-reach the agent makes in good faith.

Faithful only when it matters.

A reasoning model’s chain-of-thought is causal for its answer only when it changes the outcome — and that causal content lives in one late layer band, a readable monitoring locus.

The WANDERING arc

One question, followed honestly for ten papers.

Why do capable LLM agents loop forever and never finish — and can their internals tell us, or change it? Each step links to its permanent record.

Full reading list, methods & all papers

Why trust the claims

The discipline, not the marketing.

We publish our own nulls.

Six pre-registered walk-backs across the arc. A negative result, reported as a negative result, is the unit of progress.

Every number is recomputed.

Each paper ships an eval script that recomputes every figure from the public ledgers (35/35, 54/54, 88/88) plus a web-verified citation check.

Permanent + reproducible.

Zenodo DOIs, public GitHub, Hugging Face datasets, and one-command replication via openinterp-lab on the Colab CLI.

Depth over breadth.

One open-weights reasoning model (Qwen3.6-27B) studied deeply, with cross-architecture checks (gpt-oss-20b, Llama-3.1) where the claim is universal.

Caio Vicentino

Independent researcher · OpenInterpretability

Accepted — ICML 2026 Mechanistic Interpretability Workshop (poster)
NVIDIA Inception · AWS Activate
10 papers, permanent Zenodo DOIs, indexed on Google Scholar

ORCID Scholar Collaborate

A second line

Training & efficiency, in service of the same model.

The interpretability above runs on infrastructure we build and study in its own right.

Full-stack SAE training

11-layer sparse autoencoders on Qwen3.6-27B; the substrate for the probes above.

Mechanistic reward modeling

reward signals grounded in interpretable internal features, not surface text.

Sub-4-bit quantization

ternary / trit-plane post-training quantization for cheap open-weights inference.

Open tooling

Tools that came out of the research.

Released so others can reproduce and extend the work — not products, just the apparatus. Apache-2.0.

openinterp-lab — one-command replication of the papers on the Colab CLI openinterp-mcp — run probe-causality experiments from any agent (Claude Code, Cursor)AgentGuard — the four-layer action firewall the safety papers prototype Eval / probe schemas — the recompute-every-number harness used in each paper

Built on

We extend frontier-lab interpretability infrastructure with an agent-trajectory + honest-negatives layer. See full lineage →

Anthropic Persona Vectors (2025)Anthropic Tracing Thoughts (2025)DeepMind Gemma Scope (2024)Alibaba Qwen-Scope (2026)Arditi et al. Refusal Direction (2024)

Read it, reproduce it, or build on it.

Every claim has a permanent DOI, a public ledger, and a one-command replication. Found a flaw? That is the point — tell us.

Start with the arc oilab replicate

When should we believe a mech-interp claim?

One question, followed honestly for ten papers.

WANDERING

It is finalization, not competence

Detect ≠ control

The lever is late

It generalizes — and it brakes

The authorization direction

Felt, not granted

The late channel

Located, not secured

The criterion cannot see what it does not measure

The action lags the answer