We extend frontier-lab interpretability. We do not replicate it.
OpenInterp is the methodology and product layer above the SAE infrastructure that Anthropic, DeepMind, Alibaba, and others already ship. Our job is to turn their research into probes that ship and standards that survive Goodhart. Apache 2.0 throughout. Anti-Goodhart by construction.
Industrial-scale SAE suite for the Gemma family — Top-K + AuxK SAEs at every layer of every Gemma model, with feature labels.
How OpenInterp builds on it
Cross-model probe transfer (planned Q4 2026): apply FabricationGuard methodology to Gemma 2B/9B using their SAE features. Compare AUROC across architectures.
Official SAE suite for the Qwen family — 14 SAEs covering Qwen3 (1.7B/8B/30B-A3B base) and Qwen3.5 (2B/9B/27B/35B-A3B base). W32K-W128K, L0_50/L0_100.
How OpenInterp builds on it
Complementary coverage: Qwen-Scope ships base/MoE SAEs; OpenInterp ships SAEs for the post-trained reasoning variants (Qwen3.6-27B, Qwen3.6-35B-A3B). Q4 2026: ProbeBench will register Qwen-Scope SAEs as upstream substrates so external probes can target their features.
First public SAEs for the Qwen3.6 reasoning-tuned series. Top-K + AuxK at L11/L31/L55, n=65k dictionary, k=128 sparsity, 200M tokens. Validated via InterpScore composite metric.
How OpenInterp builds on it
Substrate for FabricationGuard, ReasonGuard, multi-probe DPO, multi-probe GRPO. Where Qwen-Scope stops (base / MoE base), we cover (reasoning).
MethodologyGoodfire · 2025
RLFR (Reinforcement Learning from Feature Rewards)
Methodology for using SAE features as reward signals in RLHF-style pipelines. Demonstrated on Llama 3.
How OpenInterp builds on it
Inspired our multi-probe DPO and multi-probe GRPO pipelines. We extend by using probes (not steering vectors) and orthogonal-axis rewards (FG ⊥ RG, Pearson +0.014).
vLLM plugin for residual stream extraction at inference time. Open-source infrastructure.
How OpenInterp builds on it
OpenInterp vLLM plugin (planned Q3 2026) builds on top: applies probes at inference time using extracted activations. UK AISI ships infra; we ship the application layer.
What this means for us
→We do not train more SAEs unless there's a gap. Qwen-Scope covers base/MoE; we cover reasoning-tuned. Gemma Scope covers Gemma; we have not duplicated that work.
→We turn closed-source methodology into shippable probes. Anthropic describes Persona Vectors in a paper; FabricationGuard puts it on PyPI with cross-task validation.
→We register honest negatives. ReasonGuard's narrow-scope cross-domain limitation is on ProbeBench. CoTGuard v1's methodology failure is documented publicly. We do not spin numbers.
→We build the standards layer. ProbeBench is to probes what SAEBench is to SAEs — but with anti-Goodhart norms baked in (random-K, fresh-probe AUROC, three-way split, judge audit).