Research-grade tools for alignment and interpretability

Mechanistic interpretability, operational.

The first open stack for training sparse autoencoders on hybrid architectures and using their features as per-token reward signals in reinforcement learning.

$pip install mechrewardalpha
GSM8K lift (Qwen3.5-4B)
+19 pp
64% → 83% in 168 effective steps
G1 correlation (Qwen3.6-35B)
ρ = 0.52
n=100 held-out SuperGPQA, p<1e-7
Sparse vs raw ablation
+11 pp
R1 (SAE-sparse) − R2 (raw direction)
Public SAEs on hybrid arch
3
GDN · ensemble-MoE · triple-hybrid

Six things that don't exist anywhere else.

Every card below corresponds to a public artifact: a trained SAE, a validated feature pack, a protocol, or an ablation result. No vaporware.

01

Hybrid-architecture SAEs

First public TopK residual-stream SAEs on Gated DeltaNet, ensemble MoE, and triple-hybrid MoE+GDN+Gated-Attn. No one else has released these.

3 models, 2048–2560 d_model, 16× expansion, 200M–1B training tokens
02

Validated feature packs

Stage Gate 1 correlation ρ=0.52–0.54 on held-out GSM8K / SuperGPQA. Features predict answer correctness across architectures.

ρ verified on n≥100 held-out per model
03

mechreward library

Per-token SAE feature activations as dense reward inside GRPO. Qwen3.5-4B → +19 pp on GSM8K in 168 effective training steps.

pip install mechreward
04

Cross-architecture evidence

Same protocol, same contrastive reward formula, runs on 4B dense-GDN, 9B ensemble-MoE, and 35B-A3B triple-hybrid. Thesis transfers.

Stage Gate protocol: G1 → G2 → G3
05

Sparse vs raw-direction ablation

Our G2 ablation (R1 SAE-sparse vs R2 raw-direction) shows an 11 pp gap on GSM8K. Sparse decomposition is causal, not cosmetic.

Direct empirical argument vs linear-probe baselines
06

Open catalog + protocol

Every SAE, every reward pack, every evaluation result is public. No black boxes. Stage Gates are reproducible step by step.

Apache-2.0, all artifacts on GitHub + HF Hub

Trained SAEs

First TopK residual-stream SAEs on architectures previously unreachable.

View all

Qwen/Qwen3.5-4B

Released

Hybrid Gated DeltaNet · Residual post-L18 · 16× expansion · 200M training tokens

First TopK residual-stream SAE for hybrid GDN

var_exp
0.866
d_sae
40,960
G1 ρ
0.540

Google/Gemma-4-E4B

Released

Ensemble MoE · Residual post-L21 · 16× expansion · 1B training tokens

First public SAE for Gemma-4 ensemble-MoE

var_exp
0.939
d_sae
32,768
G1 ρ

Qwen/Qwen3.6-35B-A3B

Training in progress

Triple-hybrid (MoE + GDN + Gated Attention) · Residual post-L23 · 16× expansion · 92M (WIP) training tokens

First public SAE on triple-hybrid MoE+GDN+Gated-Attention. No precedent in literature.

var_exp
0.835
d_sae
32,768
G1 ρ
0.522

Where we fit.

Honest comparison against the four closest prior works. All numbers are from the published papers; we don't soften or spin.

RLFR (Goodfire)

arxiv:2602.10067

Linear probes on activations → online RL reward

Their result: 58% hallucination reduction on Gemma-3-12B-IT

How we differ: We use sparse TopK SAE features instead of raw probes; the 11 pp R1-vs-R2 gap in our G2 is the empirical argument for why decomposition matters.

CRL (Holistic AI)

arxiv:2602.10437

PPO with SAE features as action space (select which feature to amplify)

Their result: +1.03 pp on GSM8K with Gemma-2-2B

How we differ: We use SAE features as the reward signal itself, not as an action space. +19 pp on Qwen3.5-4B GSM8K. Methods are complementary (different axes of using SAE features in RL).

SAE features → linear head → frozen reward model for offline RLHF

Their result: Preference-model quality improvements

How we differ: We are online, per-token, and target reasoning on hybrid architectures — not preference modeling on dense transformers.

AIRI ReasonScore

arxiv:2503.18878

SAE feature amplification at inference (contrastive around reasoning vocabulary)

Their result: +13.4% AIME-2024 on DeepSeek-R1-Distill-Llama-8B

How we differ: Inference-time intervention vs training-time reward. We ported ReasonScore in our library for completeness and ran it on Qwen3.5-4B — confirmed rhetoric features, not correctness features.

The Stage Gate protocol.

Don't spend GPU hours on RL until you've verified the signal predicts the outcome. Every validated pack in the catalog has passed all three gates.

G1

correlation pre-test

Verify features predict outcome on held-out data before spending GPU hours on RL.

Threshold
ρ ≥ 0.30
Budget
~$5, ~30 min
Artifacts
reasoning_pack.json (10 helpful + 10 harmful feature IDs)
G2

three-way reward ablation

Compare outcome-only (R0) vs outcome + SAE-sparse (R1) vs outcome + raw-direction (R2).

Threshold
R1 ≥ R0 + 2 pp AND R1 − R2 ≥ 5 pp
Budget
~$15, ~100 steps
Artifacts
R0/R1/R2 LoRA adapters + comparison table
G3

full RL, ceiling-breaking

Scale-up with per-token mech-reward, MMLU preservation check, adversarial canary suite.

Threshold
R1 ≥ 80% of target benchmark, hack_rate < 30%, MMLU regression < 2 pp
Budget
~$60, ~400 steps
Artifacts
Published LoRA adapter + full eval table + LW writeup

Validated benchmarks.

Qwen3.5-4B · Stage Gate 3 Phase A (GSM8K)

Baseline
64%
Trained (R1)
83%
Δ
+19 pp

Per-token SAE-feature reward lifts Qwen3.5-4B from 64% → 83% on GSM8K in 168 effective training steps, +7pp above the same-SAE trajectory-level G2 R1 ceiling (76%). MMLU non-regressed. Hack rate within baseline 95% CI.

Qwen3.6-35B-A3B · Stage Gate 1 (SuperGPQA)

Spearman ρ
0.522
Pearson r
0.537
n
100

First cross-architecture validation. SAE trained on 92M tokens (46% of Qwen3.5-4B budget) already matches Qwen3.5-4B correlation level (ρ=0.540). Signal transfers to triple-hybrid MoE.

Use SAE features as RL reward — in ten lines.

mechreward drops into TRL, OpenRLHF, and verl with a single import. Every feature pack is validated at ρ ≥ 0.30 on held-out data before it ships.