Hybrid-architecture SAEs
First public TopK residual-stream SAEs on Gated DeltaNet, ensemble MoE, and triple-hybrid MoE+GDN+Gated-Attn. No one else has released these.
We've shipped 11 studies on one model. Six of our own claims walked back.
From those walk-backs, a protocol. From 11 studies, a registry. From the registry, schemas the community can use. The reproducibility layer for mechanistic interpretability.
pip install "openinterp-mcp[server]"v0.1.0 newpip install openinterpv0.3.0 liveWe extend frontier-lab interpretability infrastructure with a methodology + product layer. Apache 2.0 throughout. Anti-Goodhart by construction. See full lineage →
Built from 11 studies on Qwen3.6-27B. Six walked back. From those walk-backs, the protocol that catches it next time.
A six-step decision tree for testing whether a probe, SAE feature, or steering result is causal — or epiphenomenal.
A public record of mech-interp claims that didn't survive their own diagnostics. Six first entries are ours.
JSON schemas for probe cards, causal reports, intervention traces. The format the Registry uses. Apache-2.0.
Four supporting environments around the core probe + standard + deploy stack.
See the model thinking, feature by feature, token by token.
Edit the model. Compose interventions. Export steered checkpoints.
Monitor LLMs in production. Feature-level observability for safety teams.
Onboard the world. From "what is an activation" to "discover a new feature" in 90 minutes.
openinterp-mcp is an MCP server that lets any agent — Claude Code, Cursor, Cline, OpenHands, Aider — run probe-causality experiments on your Colab session. 8 typed tools. Methodology built-in (3 mandatory baselines, 5-class verdict). We never see your model, data, or keys.
Trace Theater is the Observatory flagship. Real SAE feature IDs extracted from our multi-layer Qwen3.6-27B SAE. Token-by-token playback, heatmap, intervention slider with live counterfactuals. Try it in 2 minutes — zero login.
8 categories — hallucination, deception, sandbagging, eval-awareness — each with a composite ProbeScore. Cross-model Pearson_CE transfer included. Apache-2.0 reproducers. Open submissions.
Read-only public registry of probes, causality verdicts, and honest-negative findings. Each entry ships with a content-only sha256, an HF dataset, and an optional Zenodo DOI. Published via openinterp-mcp > publish().
Every card below corresponds to a public artifact: a trained SAE, a validated feature pack, a protocol, or an ablation result. No vaporware.
First public TopK residual-stream SAEs on Gated DeltaNet, ensemble MoE, and triple-hybrid MoE+GDN+Gated-Attn. No one else has released these.
Stage Gate 1 correlation ρ=0.52–0.54 on held-out GSM8K / SuperGPQA. Features predict answer correctness across architectures.
Per-token SAE feature activations as dense reward inside GRPO. Qwen3.5-4B → +19 pp on GSM8K in 168 effective training steps.
Same protocol, same contrastive reward formula, runs on 4B dense-GDN, 9B ensemble-MoE, and 35B-A3B triple-hybrid. Thesis transfers.
Our G2 ablation (R1 SAE-sparse vs R2 raw-direction) shows an 11 pp gap on GSM8K. Sparse decomposition is causal, not cosmetic.
Every SAE, every reward pack, every evaluation result is public. No black boxes. Stage Gates are reproducible step by step.
First TopK residual-stream SAEs on architectures previously unreachable.
Dense reasoning-tuned · 3 layers in parallel (L11/L31/L55) · Residual post-L11 · L31 · L55 · 13× expansion · 200M training tokens
Paper-grade 3-layer SAE on Qwen3.6-27B · held-out var_expl L11 0.843 · L31 0.714 · L55 0.816 · AuxK
huggingface.co/caiovicentino1/qwen36-27b-sae-papergrade
Hybrid Gated DeltaNet · Residual post-L18 · 16× expansion · 200M training tokens
First TopK residual-stream SAE for hybrid GDN
huggingface.co/caiovicentino1/Qwen3.5-4B-SAE-L18-topk
Ensemble MoE · Residual post-L21 · 16× expansion · 1B training tokens
First public SAE for Gemma-4 ensemble-MoE
huggingface.co/caiovicentino1/Gemma-4-E4B-SAE-L21-topk
Triple-hybrid (MoE + GDN + Gated Attention) · Residual post-L23 · 16× expansion · 92M (WIP) training tokens
First public SAE on triple-hybrid MoE+GDN+Gated-Attention. No precedent in literature.
huggingface.co/caiovicentino1/Qwen3.6-35B-A3B-SAE-L23-topk-wip
Honest comparison against the four closest prior works. All numbers are from the published papers; we don't soften or spin.
Linear probes on activations → online RL reward
Their result: 58% hallucination reduction on Gemma-3-12B-IT
How we differ: We use sparse TopK SAE features instead of raw probes; the 11 pp R1-vs-R2 gap in our G2 is the empirical argument for why decomposition matters.
PPO with SAE features as action space (select which feature to amplify)
Their result: +1.03 pp on GSM8K with Gemma-2-2B
How we differ: We use SAE features as the reward signal itself, not as an action space. +19 pp on Qwen3.5-4B GSM8K. Methods are complementary (different axes of using SAE features in RL).
SAE features → linear head → frozen reward model for offline RLHF
Their result: Preference-model quality improvements
How we differ: We are online, per-token, and target reasoning on hybrid architectures — not preference modeling on dense transformers.
SAE feature amplification at inference (contrastive around reasoning vocabulary)
Their result: +13.4% AIME-2024 on DeepSeek-R1-Distill-Llama-8B
How we differ: Inference-time intervention vs training-time reward. We ported ReasonScore in our library for completeness and ran it on Qwen3.5-4B — confirmed rhetoric features, not correctness features.
Don't spend GPU hours on RL until you've verified the signal predicts the outcome. Every validated pack in the catalog has passed all three gates.
Verify features predict outcome on held-out data before spending GPU hours on RL.
Compare outcome-only (R0) vs outcome + SAE-sparse (R1) vs outcome + raw-direction (R2).
Scale-up with per-token mech-reward, MMLU preservation check, adversarial canary suite.
Per-token SAE-feature reward lifts Qwen3.5-4B from 64% → 83% on GSM8K in 168 effective training steps, +7pp above the same-SAE trajectory-level G2 R1 ceiling (76%). MMLU non-regressed. Hack rate within baseline 95% CI.
First cross-architecture validation. SAE trained on 92M tokens (46% of Qwen3.5-4B budget) already matches Qwen3.5-4B correlation level (ρ=0.540). Signal transfers to triple-hybrid MoE.
Six shipped artifacts above are the current work. Below are three structural bets we think get more valuable over time — built in public, open to being wrong.
A feature-equivalence graph across Qwen, Gemma, Llama, Claude, Mistral. "Feature 2503 in Qwen ≈ feature 8901 in Gemma" — rendered, searchable, citable. Grows with every community-submitted SAE; the dataset gets more useful the more people contribute.
A paid monitoring API (Watchtower) is designed to subsidize the OSS tier long-term — so students, researchers, and contributors never hit a paywall. Paid where it sustains (safety teams, compliance, vendor integration); free where it matters.
Working with model vendors and research labs to ship SAEs alongside model releases. When an open-source SAE lands on the same day as the model, interpretability becomes part of the release process rather than an afterthought — a better default for everyone.
Trace, edit, monitor, teach. Every SAE is public. Every Stage Gate is reproducible. Watchtower Enterprise funds the OSS tier. Join us — or watch from the sidelines.