Research

Papers, posts, and roadmap.

Every artifact is open. Negative results are publishable. Broken links, stale numbers, or methodological bugs get flagged and fixed in public.

Published / in review

Per-token SAE features as online RL reward: breaking the G2 76% ceiling on GSM8K

Under moderator review

LessWrong (2026-04-17)

Full writeup of Stage Gate 1 → 2 → 3 on Qwen3.5-4B. GSM8K 64% → 83% in 168 effective training steps. MMLU preserved. Adversarial canary hack rate within 95% CI.

Read

Qwen3.6-35B-A3B SAE at L23 — Stage Gate 1 passed (ρ=0.522)

Published artifact

GitHub release · mechreward catalog

First SAE on triple-hybrid MoE + GDN + Gated-Attention architecture. Matches Qwen3.5-4B correlation level with 46% of the training budget.

Read

Circuit-tracer integration gap report (4 concrete issues)

Upstream issue pending

GitHub · mechreward

Integration audit of Anthropic's circuit-tracer against our hybrid-GDN SAEs. Four actionable gaps with reproducers.

Read

Roadmap

Living document. Items change as results come in.

Now
  • ·Stage Gate 2 for Qwen3.6-35B-A3B (3-way reward ablation: R0 / R1 / R2)
  • ·Stage Gate 3 Phase A for Qwen3.6-35B-A3B
  • ·Cross-architecture benchmark matrix (GSM8K, SuperGPQA, MATH-500)
Next
  • ·Auto-interp pipeline (OpenInterp-labeled features)
  • ·Feature-pack marketplace v1 (community submissions)
  • ·Paper v1 on arXiv (paper-form of LW post + S4 extensions)
  • ·Gemma-4-E4B G1 + G2 + G3 full pipeline
Later
  • ·Safety-focused feature packs (beyond reasoning)
  • ·Integration with Anthropic circuit-tracer via native plugin
  • ·Multi-step, scheduled reward shaping (intentional design roadmap)
  • ·Hybrid-arch SAE training library (saelib-hybrid, fork of sae_lens)

Cite this work

BibTeX for the library + protocol (paper arXiv forthcoming):

@software{openinterpretability2026mechreward,
  author = {Vicentino, Caio and contributors},
  title  = {mechreward: Mechanistic interpretability as reward signal for RL},
  year   = {2026},
  url    = {https://github.com/caiovicentino/mechreward},
  note   = {OpenInterpretability project},
}