FabricationGuard live · ProbeBench v0.0.1 · Multi-Probe DPO

Probes that ship. Standards that survive.

The application layer for mechanistic interpretability.

We turn frontier-lab interpretability research into production-grade safety probes. Built on Anthropic Persona Vectors, DeepMind Gemma Scope, and Alibaba Qwen-Scope. Apache 2.0 · Anti-Goodhart by construction · ~1ms inference.

Try FabricationGuard Browse ProbeBench Read papers

$pip install openinterpv0.2.0 live

🎯 0.88 AUROC

FabricationGuard

cross-task hallucination · ~1ms · PyPI v0.2.0

📊 5 probes

ProbeBench

anti-Goodhart leaderboard · 7-axis ProbeScore · v0.0.1

📜 ICML MI #73

Paper-1 in review

Hallucination-Induction, Not Calibration · notification June 12

⚙️ Apache 2.0

All artifacts

5 GitHub repos · PyPI · 7 HF datasets · 1 SAE model

Built on

We extend frontier-lab interpretability infrastructure with a methodology + product layer. Apache 2.0 throughout. Anti-Goodhart by construction. See full lineage →

Anthropic Persona Vectors (Aug 2025)Anthropic Tracing Thoughts (Mar 2025)DeepMind Gemma Scope (2024)Alibaba Qwen-Scope (Apr 2026)Goodfire RLFR (2025)

What we ship

Detect. Standardize. Deploy.

Production probes. Goodhart-resistant standards. Inference-time integrations. Built on top of the SAE infrastructure that frontier labs already shipped.

Detect

Production-grade probes for hallucination, deception, and reasoning faithfulness.

FabricationGuard — 0.88 AUROC cross-task · ~1ms · PyPI v0.2.0
ReasonGuard — narrow-scope: 0.888 within-domain · honest negative cross-domain
DeceptionGuard — Apollo methodology, Q3 2026
CoTGuard v2 — causal-mediation methodology, Q3 2026

Explore detect

Standardize

ProbeBench — the anti-Goodhart leaderboard for activation probes.

ProbeBench v0.0.1 — 8 categories · 7-axis ProbeScore · 5 reference probes
Anti-Goodhart norms — random-K controls · fresh-probe AUROC · 3-way splits
InterpScore — composite metric for SAE evaluation (vs SAEBench)
Cross-substrate registry — register Qwen-Scope, Gemma Scope SAEs as upstream

Explore standardize

Deploy

vLLM/SGLang plugins, agent integrations, regulated-industry adapters.

openinterp SDK — `pip install openinterp` v0.2.0
HuggingFace Spaces — FabricationGuard ZeroGPU demo + ProbeBench leaderboard
vLLM plugin — inference-time probe scoring · Q2 2026
Medical / financial adapters — EU AI Act + FDA SaMD compliance

Explore deploy

Tools we maintain

Observe. Edit. Monitor. Teach.

Four supporting environments around the core probe + standard + deploy stack.

LIVE · Q1

Observatory

See the model thinking, feature by feature, token by token.

Trace Theater — Cinematic timeline.
Circuit Canvas — Figma-style attribution graphs.
Atlas — Google Scholar for features.
Compare — N-way diff.

Enter Observatory

Q2 2026

Laboratory

Edit the model. Compose interventions. Export steered checkpoints.

Sandbox — Drag-and-drop steering.
Recipe Store — Public marketplace of steering packs — helpful, honest, harmless.
Auto-Interp Engine — Upload a failure dataset, get back the top 50 features correlated with errors.
Counterfactual Studio — "What if this token were X?" Surgical replay with exact-token edits; see how features downstream shift.

Enter Laboratory

Q4 2026 · ENTERPRISE

Watchtower

Monitor LLMs in production. Feature-level observability for safety teams.

Feature Firehose API — Low-latency streaming API.
Safety Watchlist — Deception, sycophancy, shutdown-resistance, jailbreak-fingerprint features monitored 24/7.
Audit Trail — Immutable logs of every feature activation.

Enter Watchtower

Q3 2026

Academy

Onboard the world. From "what is an activation" to "discover a new feature" in 90 minutes.

Expeditions — Interactive tutorials with validated checkpoints and badge awards.
Interp Olympics — Monthly feature-hunting challenges with leaderboards and prizes.
Live Lectures — Researchers lecture inside the tool.
Reproducibility Vault — Every trace, circuit, recipe, and notebook hashed forever.

Enter Academy

FLAGSHIP · LIVE Q1 2026

Scrub a prompt through Qwen3.6-27B. Watch features ignite.

Trace Theater is the Observatory flagship. Real SAE feature IDs extracted from our multi-layer Qwen3.6-27B SAE. Token-by-token playback, heatmap, intervention slider with live counterfactuals. Try it in 2 minutes — zero login.

Open Trace Theater See all Observatory tools

Preview · L31 heatmap · 10 features × 23 tokens

f2503, f3383, f1847, f4521, f2156, f3892, f152 — shown

ProbeBench v0.0.1

The first leaderboard for activation probes.

8 categories — hallucination, deception, sandbagging, eval-awareness — each with a composite ProbeScore. Cross-model Pearson_CE transfer included. Apache-2.0 reproducers. Open submissions.

Open the leaderboard Submit your probe

Hallucination

0.882

cross-task SimpleQA

Deception

0.96+

Apollo re-impl

Eval-awareness

pending

UK AISI priority

Reward-hacking

pending

Anthropic 2511.18397

Six things that don't exist anywhere else.

Every card below corresponds to a public artifact: a trained SAE, a validated feature pack, a protocol, or an ablation result. No vaporware.

Hybrid-architecture SAEs

First public TopK residual-stream SAEs on Gated DeltaNet, ensemble MoE, and triple-hybrid MoE+GDN+Gated-Attn. No one else has released these.

3 models, 2048–2560 d_model, 16× expansion, 200M–1B training tokens

Validated feature packs

Stage Gate 1 correlation ρ=0.52–0.54 on held-out GSM8K / SuperGPQA. Features predict answer correctness across architectures.

ρ verified on n≥100 held-out per model

mechreward library

Per-token SAE feature activations as dense reward inside GRPO. Qwen3.5-4B → +19 pp on GSM8K in 168 effective training steps.

pip install mechreward

Cross-architecture evidence

Same protocol, same contrastive reward formula, runs on 4B dense-GDN, 9B ensemble-MoE, and 35B-A3B triple-hybrid. Thesis transfers.

Stage Gate protocol: G1 → G2 → G3

Sparse vs raw-direction ablation

Our G2 ablation (R1 SAE-sparse vs R2 raw-direction) shows an 11 pp gap on GSM8K. Sparse decomposition is causal, not cosmetic.

Direct empirical argument vs linear-probe baselines

Open catalog + protocol

Every SAE, every reward pack, every evaluation result is public. No black boxes. Stage Gates are reproducible step by step.

Apache-2.0, all artifacts on GitHub + HF Hub

Trained SAEs

First TopK residual-stream SAEs on architectures previously unreachable.

View all

Qwen/Qwen3.6-27B

Released

Dense reasoning-tuned · 3 layers in parallel (L11/L31/L55) · Residual post-L11 · L31 · L55 · 13× expansion · 200M training tokens

Paper-grade 3-layer SAE on Qwen3.6-27B · held-out var_expl L11 0.843 · L31 0.714 · L55 0.816 · AuxK

huggingface.co/caiovicentino1/qwen36-27b-sae-papergrade

Qwen/Qwen3.5-4B

Released

Hybrid Gated DeltaNet · Residual post-L18 · 16× expansion · 200M training tokens

First TopK residual-stream SAE for hybrid GDN

huggingface.co/caiovicentino1/Qwen3.5-4B-SAE-L18-topk

Google/Gemma-4-E4B

Released

Ensemble MoE · Residual post-L21 · 16× expansion · 1B training tokens

First public SAE for Gemma-4 ensemble-MoE

huggingface.co/caiovicentino1/Gemma-4-E4B-SAE-L21-topk

Qwen/Qwen3.6-35B-A3B

Training in progress

Triple-hybrid (MoE + GDN + Gated Attention) · Residual post-L23 · 16× expansion · 92M (WIP) training tokens

First public SAE on triple-hybrid MoE+GDN+Gated-Attention. No precedent in literature.

huggingface.co/caiovicentino1/Qwen3.6-35B-A3B-SAE-L23-topk-wip

Where we fit.

Honest comparison against the four closest prior works. All numbers are from the published papers; we don't soften or spin.

RLFR (Goodfire)

arxiv:2602.10067

Linear probes on activations → online RL reward

Their result: 58% hallucination reduction on Gemma-3-12B-IT

How we differ: We use sparse TopK SAE features instead of raw probes; the 11 pp R1-vs-R2 gap in our G2 is the empirical argument for why decomposition matters.

CRL (Holistic AI)

arxiv:2602.10437

PPO with SAE features as action space (select which feature to amplify)

Their result: +1.03 pp on GSM8K with Gemma-2-2B

How we differ: We use SAE features as the reward signal itself, not as an action space. +19 pp on Qwen3.5-4B GSM8K. Methods are complementary (different axes of using SAE features in RL).

SARM

arxiv:2508.08746

SAE features → linear head → frozen reward model for offline RLHF

Their result: Preference-model quality improvements

How we differ: We are online, per-token, and target reasoning on hybrid architectures — not preference modeling on dense transformers.

AIRI ReasonScore

arxiv:2503.18878

SAE feature amplification at inference (contrastive around reasoning vocabulary)

Their result: +13.4% AIME-2024 on DeepSeek-R1-Distill-Llama-8B

How we differ: Inference-time intervention vs training-time reward. We ported ReasonScore in our library for completeness and ran it on Qwen3.5-4B — confirmed rhetoric features, not correctness features.

The Stage Gate protocol.

Don't spend GPU hours on RL until you've verified the signal predicts the outcome. Every validated pack in the catalog has passed all three gates.

correlation pre-test

Verify features predict outcome on held-out data before spending GPU hours on RL.

Threshold: ρ ≥ 0.30
Budget: ~$5, ~30 min
Artifacts: reasoning_pack.json (10 helpful + 10 harmful feature IDs)

three-way reward ablation

Compare outcome-only (R0) vs outcome + SAE-sparse (R1) vs outcome + raw-direction (R2).

Threshold: R1 ≥ R0 + 2 pp AND R1 − R2 ≥ 5 pp
Budget: ~$15, ~100 steps
Artifacts: R0/R1/R2 LoRA adapters + comparison table

full RL, ceiling-breaking

Scale-up with per-token mech-reward, MMLU preservation check, adversarial canary suite.

Threshold: R1 ≥ 80% of target benchmark, hack_rate < 30%, MMLU regression < 2 pp
Budget: ~$60, ~400 steps
Artifacts: Published LoRA adapter + full eval table + LW writeup

Validated benchmarks.

Qwen3.5-4B · Stage Gate 3 Phase A (GSM8K)

Baseline

64%

Trained (R1)

83%

+19 pp

Per-token SAE-feature reward lifts Qwen3.5-4B from 64% → 83% on GSM8K in 168 effective training steps, +7pp above the same-SAE trajectory-level G2 R1 ceiling (76%). MMLU non-regressed. Hack rate within baseline 95% CI.

Qwen3.6-35B-A3B · Stage Gate 1 (SuperGPQA)

Spearman ρ

0.522

Pearson r

0.537

100

First cross-architecture validation. SAE trained on 92M tokens (46% of Qwen3.5-4B budget) already matches Qwen3.5-4B correlation level (ρ=0.540). Signal transfers to triple-hybrid MoE.

Three long-term bets

Where we think we can compound.

Six shipped artifacts above are the current work. Below are three structural bets we think get more valuable over time — built in public, open to being wrong.

01 · BET

Cross-model feature graph

A feature-equivalence graph across Qwen, Gemma, Llama, Claude, Mistral. "Feature 2503 in Qwen ≈ feature 8901 in Gemma" — rendered, searchable, citable. Grows with every community-submitted SAE; the dataset gets more useful the more people contribute.

Q2 2026: first rendering with 2 models. Q4: 5+ models.

02 · BET

Revenue that funds the free tier

A paid monitoring API (Watchtower) is designed to subsidize the OSS tier long-term — so students, researchers, and contributors never hit a paywall. Paid where it sustains (safety teams, compliance, vendor integration); free where it matters.

Target: first design partner Q3, first revenue Q4. Not yet proven.

03 · BET

Model partnerships

Working with model vendors and research labs to ship SAEs alongside model releases. When an open-source SAE lands on the same day as the model, interpretability becomes part of the release process rather than an afterthought — a better default for everyone.

First partnership conversation active. Nothing signed yet.

12-month roadmap

Built in public, quarter by quarter.

Full roadmap

Q2 2026· NOW

Probes shipped + paper-1 in review

→FabricationGuard v0.2.0 live · 0.88 AUROC cross-task hallucination · ~1ms
→ReasonGuard live · narrow-scope honest negative documented
→ProbeBench v0.0.1 · 5 reference probes · 7-axis ProbeScore · anti-Goodhart norms
+ 2 more

Q3 2026

More probes + integrations

→DeceptionGuard v0.1 · Apollo methodology applied to Qwen3.6
→CoTGuard v2 · causal-mediation methodology (Lanham 2023 truncation)
→BehaviorGuard · CoT-vs-action consistency for agentic systems
+ 2 more

Q4 2026

Cross-substrate + cross-model

→ProbeBench v0.1 · register Qwen-Scope, Gemma Scope SAEs as upstream substrates
→Cross-model probe transfer · FabricationGuard methodology on Llama / DeepSeek
→Probe registry API · external probes can register against any substrate
+ 2 more

Q1 2027

Regulated industries + revenue

→Medical adapter · EU AI Act Article 14 + FDA SaMD compliance probe pack
→Financial adapter · audit trail + immutable feature activation logs
→Cursor / Cline / agent integrations · BehaviorGuard for tool-use
+ 2 more

Mechanistic interpretability, operational.

Trace, edit, monitor, teach. Every SAE is public. Every Stage Gate is reproducible. Watchtower Enterprise funds the OSS tier. Join us — or watch from the sidelines.

Open Trace Theater Star on GitHub Watchtower early access Read the quickstart