Back homeBUILT ON · LINEAGE

We extend frontier-lab interpretability.
We do not replicate it.

OpenInterp is the methodology and product layer above the SAE infrastructure that Anthropic, DeepMind, Alibaba, and others already ship. Our job is to turn their research into probes that ship and standards that survive Goodhart. Apache 2.0 throughout. Anti-Goodhart by construction.

MethodologyAnthropic · Aug 2025

Persona Vectors

source

Methodology for extracting persona-aligned linear directions from activations. Demonstrated on Claude 3.5 Sonnet at 7-8B scale internally.

How OpenInterp builds on it

FabricationGuard productizes this methodology on Qwen3.6-27B reasoning. AUROC 0.88 cross-task on hallucination detection. Apache 2.0, ~1ms inference.

MethodologyAnthropic · Mar 2025

Tracing Thoughts

source

Internal-state introspection methodology to distinguish faithful from unfaithful reasoning via attribution graphs.

How OpenInterp builds on it

ReasonGuard derives a deployable linear probe at L55/mid_think on Qwen3.6-27B. Honest narrow-scope finding (0.888 within math, 0.605 cross-domain).

MethodologyAnthropic · Apr 2025

Reasoning Models Don't Always Say What They Think

source

Hint-injection methodology measuring CoT faithfulness. Found 25-41% strict faithfulness on Claude 3.7 / DeepSeek R1.

How OpenInterp builds on it

CoTGuard v1 attempted hint-injection on Qwen3.6-27B + Claude Haiku judge — found 98.6% positive rate (judge too liberal). CoTGuard v2 will use causal-mediation methodology (Lanham 2023 truncation). Honest learning shipped publicly.

SAE substrateDeepMind · 2024

Gemma Scope

source

Industrial-scale SAE suite for the Gemma family — Top-K + AuxK SAEs at every layer of every Gemma model, with feature labels.

How OpenInterp builds on it

Cross-model probe transfer (planned Q4 2026): apply FabricationGuard methodology to Gemma 2B/9B using their SAE features. Compare AUROC across architectures.

SAE substrateAlibaba · Apr 2026

Qwen-Scope

source

Official SAE suite for the Qwen family — 14 SAEs covering Qwen3 (1.7B/8B/30B-A3B base) and Qwen3.5 (2B/9B/27B/35B-A3B base). W32K-W128K, L0_50/L0_100.

How OpenInterp builds on it

Complementary coverage: Qwen-Scope ships base/MoE SAEs; OpenInterp ships SAEs for the post-trained reasoning variants (Qwen3.6-27B, Qwen3.6-35B-A3B). Q4 2026: ProbeBench will register Qwen-Scope SAEs as upstream substrates so external probes can target their features.

SAE substrateOpenInterp · Apr 2026

OpenInterp paper-grade SAEs

source

First public SAEs for the Qwen3.6 reasoning-tuned series. Top-K + AuxK at L11/L31/L55, n=65k dictionary, k=128 sparsity, 200M tokens. Validated via InterpScore composite metric.

How OpenInterp builds on it

Substrate for FabricationGuard, ReasonGuard, multi-probe DPO, multi-probe GRPO. Where Qwen-Scope stops (base / MoE base), we cover (reasoning).

MethodologyGoodfire · 2025

RLFR (Reinforcement Learning from Feature Rewards)

source

Methodology for using SAE features as reward signals in RLHF-style pipelines. Demonstrated on Llama 3.

How OpenInterp builds on it

Inspired our multi-probe DPO and multi-probe GRPO pipelines. We extend by using probes (not steering vectors) and orthogonal-axis rewards (FG ⊥ RG, Pearson +0.014).

MethodologyMarks et al. (ICLR 2025) · Jan 2024

Sparse Feature Circuits

source

Methodology for discovering and editing interpretable causal graphs at SAE-feature level using attribution patching.

How OpenInterp builds on it

Foundation for ProbeBench anti-Goodhart norms (random-K controls, three-way splits). Mentioned in paper-1 ICML submission.

SAE substrateGao et al. (OpenAI) · 2024

Top-K + AuxK SAE

source

Top-K SAE architecture with auxiliary k-dead-feature mitigation. Now standard for SAE training across labs.

How OpenInterp builds on it

Used directly in our paper-grade Qwen3.6-27B and Qwen3.6-35B-A3B SAEs.

Tooling / infraDecode Research · 2024

Neuronpedia

source

SAE feature encyclopedia for browsing, citing, and exporting features across model families.

How OpenInterp builds on it

OpenInterp Atlas (Q3 2026) integrates with Neuronpedia for cross-model feature search. Symbiotic — they curate, we apply.

Tooling / infraUK AISI · Mar 2026

vLLM-Lens

source

vLLM plugin for residual stream extraction at inference time. Open-source infrastructure.

How OpenInterp builds on it

OpenInterp vLLM plugin (planned Q3 2026) builds on top: applies probes at inference time using extracted activations. UK AISI ships infra; we ship the application layer.

What this means for us

→We do not train more SAEs unless there's a gap. Qwen-Scope covers base/MoE; we cover reasoning-tuned. Gemma Scope covers Gemma; we have not duplicated that work.
→We turn closed-source methodology into shippable probes. Anthropic describes Persona Vectors in a paper; FabricationGuard puts it on PyPI with cross-task validation.
→We register honest negatives. ReasonGuard's narrow-scope cross-domain limitation is on ProbeBench. CoTGuard v1's methodology failure is documented publicly. We do not spin numbers.
→We build the standards layer. ProbeBench is to probes what SAEBench is to SAEs — but with anti-Goodhart norms baked in (random-K, fresh-probe AUROC, three-way split, judge audit).

Read the manifesto See our papers Browse ProbeBench

We extend frontier-lab interpretability.We do not replicate it.

Persona Vectors

Tracing Thoughts

Reasoning Models Don't Always Say What They Think

Gemma Scope

Qwen-Scope

OpenInterp paper-grade SAEs

RLFR (Reinforcement Learning from Feature Rewards)

Sparse Feature Circuits

Top-K + AuxK SAE

Neuronpedia

vLLM-Lens

What this means for us

We extend frontier-lab interpretability.
We do not replicate it.