Back homeMANIFESTO · 2026

The microscope is built. Now we need the standards.

The thesis

OpenInterpretability is building the reproducibility and runtime layer for mechanistic interpretability findings in agentic systems — every probe inspectable, every methodology re-runnable, every claim citable.

Anthropic published Persona Vectors and Tracing Thoughts. DeepMind shipped Gemma Scope. Alibaba shipped Qwen-Scope. Neuronpedia built the encyclopedia. Goodfire raised $150M to commercialize the substrate. The interpretability infrastructure is, finally, a thing that exists.

What does not exist yet — at least not in shippable form — is the methodology and product layer above it. The probes that take SAE features and turn them into something a developer can put in front of a customer. The benchmarks that survive Goodhart. The standards that distinguish a probe that learned the underlying signal from a probe that learned a confound. The deployment plumbing that lets a hospital safety team actually use a 27B activation probe in production.

That is the gap OpenInterp fills. We don't train more SAEs — frontier labs already do that better than we ever could. We turn their work into probes that ship and standards that survive Goodhart. Apache 2.0 throughout. Anti-Goodhart by construction.

The five gaps

The gaps that no current tool fills:

Narrative / trace
Features shown in isolation, never the full journey of a prompt. A model's thought is a sequence, not a dictionary entry.
Comparison
"Why did model A answer X but model B answer Y?" is the question that matters in a reasoning-model world. No tool diffs activations side-by-side.
Circuits as UI
Every beautiful circuit figure in the 2024–2026 literature was hand-drawn in Figma. Circuits live in papers, not tools.
Onboarding
UX assumes PhD-level familiarity. Students bounce in 90 seconds. The field grows slower than the problem.
Failure archaeology
"My model hallucinated, which features fired?" Today: write a notebook. Tomorrow: upload a failure dataset, get a ranked feature list back automatically.

The four pillars

OpenInterp is built as four complementary platforms — the minimum set that fills all five gaps and also finances itself:

Observatory is the microscope — watch the model. Laboratory is the workbench — edit the model. Watchtower is the gantry — monitor the model. Academy is the school — teach the world to use the other three. One platform. Four ways in.

Three structural bets

01

Cross-model feature graph

A feature-equivalence graph across Qwen, Gemma, Llama, Claude, Mistral. "Feature 2503 in Qwen ≈ feature 8901 in Gemma" — rendered, searchable, citable. Grows with every community-submitted SAE; the dataset gets more useful the more people contribute.

Q2 2026: first rendering with 2 models. Q4: 5+ models.
02

Revenue that funds the free tier

A paid monitoring API (Watchtower) is designed to subsidize the OSS tier long-term — so students, researchers, and contributors never hit a paywall. Paid where it sustains (safety teams, compliance, vendor integration); free where it matters.

Target: first design partner Q3, first revenue Q4. Not yet proven.
03

Model partnerships

Working with model vendors and research labs to ship SAEs alongside model releases. When an open-source SAE lands on the same day as the model, interpretability becomes part of the release process rather than an afterthought — a better default for everyone.

First partnership conversation active. Nothing signed yet.

What we uniquely bring

Every claim below is grounded in a shipped, Apache-2.0 licensed, public artifact:

  • First public SAE on the Qwen3.6 familydense (27B) and triple-hybrid MoE (35B-A3B). Verified against HuggingFace at shipping time; zero competitors.
  • Hybrid architecture expertisefirst SAEs on Gated Delta Networks (Qwen3.5-4B), ensemble MoE (Gemma-4 E4B), and triple-hybrid (Qwen3.6-35B-A3B). Landscape was previously uninterpretable.
  • mechreward — features as RL rewards+19 pp on GSM8K (Qwen3.5-4B) in 168 effective training steps via per-token SAE-sparse rewards. ρ=0.52 cross-architecture on SuperGPQA. pip install mechreward.
  • Stage Gate protocolcorrelation pre-test (G1) → three-way ablation (G2) → ceiling-breaking full RL (G3). Don't burn GPU hours until the signal predicts the outcome.
  • Honest negativeswe published the feature-circuits result that failed replication. Trust comes from admitting what broke.

The first-minute experience

A student in Mumbai, on a phone, at 2:00 am, in two minutes — discovers a hallucination feature in GPT-5 that nobody has seen. Publishes a mini-paper embedded in the platform. Has three DeepMind researchers commenting in real time before breakfast.

That is the north star. Everything — the hero animation, the mobile-first layout, the zero-login Trace Theater, the shareable trace URLs, the Expedition that validates your work in 15 minutes instead of 15 weeks — is optimized toward that one scene.

Neuronpedia is a tab you consult. OpenInterp is a tab you leave open.

How to get involved

Manifesto last revised 2026-04-23. Build in public. Amend in public.