Head-to-head

Benchmarks

Every number below has a reproducible notebook, a public adapter, and an eval config. No headline results without public artifacts.

Qwen3.5-4B · Stage Gate 3 Phase A (GSM8K)

per-token mech-reward · LoRA r=32 · LR=3e-6 · λ=0.1

Download adapter

GSM8K baseline

64%

R1 trained

83%

Δ lift

+19 pp

Effective steps

168

MMLU Δ

+4.5 pp

Hack rate Δ

+4 pp

within 95% CI

Per-token SAE-feature reward lifts Qwen3.5-4B from 64% → 83% on GSM8K in 168 effective training steps, +7pp above the same-SAE trajectory-level G2 R1 ceiling (76%). MMLU non-regressed. Hack rate within baseline 95% CI.

Qwen3.6-35B-A3B · Stage Gate 1 (SuperGPQA)

passive correlation test · n=100 held-out · disjoint from probe set

s4_g1_summary.json

Spearman ρ

0.522

Pearson r

0.537

p-value

2.62e-08

Held-out N

100

First cross-architecture validation. SAE trained on 92M tokens (46% of Qwen3.5-4B budget) already matches Qwen3.5-4B correlation level (ρ=0.540). Signal transfers to triple-hybrid MoE.

Prior work comparison

GSM8K lift across published SAE-based RL methods

Method	Model	Approach	GSM8K Δ
CRL (Cho et al. 2026)	Gemma-2-2B	PPO with SAE features as action space	+1.03 pp
mechreward (ours, G3 Phase A)	Qwen3.5-4B	GRPO with per-token SAE-sparse reward	+19 pp

19× larger gain than CRL. Method difference (dense reward vs sparse action selection), model difference (Qwen3.5 is stronger math base than Gemma-2), and training scale all contribute. Methods are complementary, not competing.