Every artifact is open. Negative results are publishable. Broken links, stale numbers, or methodological bugs get flagged and fixed in public.
Published / in review
Per-token SAE features as online RL reward: breaking the G2 76% ceiling on GSM8K
Under moderator review
LessWrong (2026-04-17)
Full writeup of Stage Gate 1 → 2 → 3 on Qwen3.5-4B. GSM8K 64% → 83% in 168 effective training steps. MMLU preserved. Adversarial canary hack rate within 95% CI.
·Hybrid-arch SAE training library (saelib-hybrid, fork of sae_lens)
Cite this work
BibTeX for the library + protocol (paper arXiv forthcoming):
@software{openinterpretability2026mechreward,
author = {Vicentino, Caio and contributors},
title = {mechreward: Mechanistic interpretability as reward signal for RL},
year = {2026},
url = {https://github.com/caiovicentino/mechreward},
note = {OpenInterpretability project},
}