Benchmarks
Every number below has a reproducible notebook, a public adapter, and an eval config. No headline results without public artifacts.
Qwen3.5-4B · Stage Gate 3 Phase A (GSM8K)
per-token mech-reward · LoRA r=32 · LR=3e-6 · λ=0.1
Per-token SAE-feature reward lifts Qwen3.5-4B from 64% → 83% on GSM8K in 168 effective training steps, +7pp above the same-SAE trajectory-level G2 R1 ceiling (76%). MMLU non-regressed. Hack rate within baseline 95% CI.
Qwen3.6-35B-A3B · Stage Gate 1 (SuperGPQA)
passive correlation test · n=100 held-out · disjoint from probe set
First cross-architecture validation. SAE trained on 92M tokens (46% of Qwen3.5-4B budget) already matches Qwen3.5-4B correlation level (ρ=0.540). Signal transfers to triple-hybrid MoE.
Prior work comparison
GSM8K lift across published SAE-based RL methods
| Method | Model | Approach | GSM8K Δ |
|---|---|---|---|
| CRL (Cho et al. 2026) | Gemma-2-2B | PPO with SAE features as action space | +1.03 pp |
| mechreward (ours, G3 Phase A) | Qwen3.5-4B | GRPO with per-token SAE-sparse reward | +19 pp |
19× larger gain than CRL. Method difference (dense reward vs sparse action selection), model difference (Qwen3.5 is stronger math base than Gemma-2), and training scale all contribute. Methods are complementary, not competing.