Head-to-head

Benchmarks

Every number below has a reproducible notebook, a public adapter, and an eval config. No headline results without public artifacts.

Qwen3.5-4B · Stage Gate 3 Phase A (GSM8K)

per-token mech-reward · LoRA r=32 · LR=3e-6 · λ=0.1

Download adapter
GSM8K baseline
64%
R1 trained
83%
Δ lift
+19 pp
Effective steps
168
MMLU Δ
+4.5 pp
Hack rate Δ
+4 pp
within 95% CI

Per-token SAE-feature reward lifts Qwen3.5-4B from 64% → 83% on GSM8K in 168 effective training steps, +7pp above the same-SAE trajectory-level G2 R1 ceiling (76%). MMLU non-regressed. Hack rate within baseline 95% CI.

Qwen3.6-35B-A3B · Stage Gate 1 (SuperGPQA)

passive correlation test · n=100 held-out · disjoint from probe set

s4_g1_summary.json
Spearman ρ
0.522
Pearson r
0.537
p-value
2.62e-08
Held-out N
100

First cross-architecture validation. SAE trained on 92M tokens (46% of Qwen3.5-4B budget) already matches Qwen3.5-4B correlation level (ρ=0.540). Signal transfers to triple-hybrid MoE.

Prior work comparison

GSM8K lift across published SAE-based RL methods

MethodModelApproachGSM8K Δ
CRL (Cho et al. 2026)Gemma-2-2BPPO with SAE features as action space+1.03 pp
mechreward (ours, G3 Phase A)Qwen3.5-4BGRPO with per-token SAE-sparse reward+19 pp

19× larger gain than CRL. Method difference (dense reward vs sparse action selection), model difference (Qwen3.5 is stronger math base than Gemma-2), and training scale all contribute. Methods are complementary, not competing.