Back home
LIVE · v0.0.1

InterpScore

Composite SAE evaluation — one number that summarizes reconstruction, sparsity, dead-feature pressure, interpretability, and causal faithfulness. Transparent weights, editable formula, individual components always reported.

Formula · v0.0.1

InterpScore = 0.30 × loss_recovered
            + 0.15 × (1 − dead_frac)
            + 0.15 × l0_score          # exp(−|log(L0/80)|) · peaks at L0 ≈ 80
            + 0.25 × sparse_probing_auc
            + 0.15 × tpp_score
0.30loss_recoveredreconstruction fidelity
0.15alive1 − dead feature fraction
0.15l0_scoreexp(−|log(L0/80)|) · peaks at L0 ≈ 80
0.25sparse_probingAUROC on held-out labels (SAEBench)
0.15tppTargeted Probe Perturbation (causal faithfulness)

Why a composite when SAEBench authors (Karvonen 2025) explicitly refuse one? Because it makes SAEs comparable at a glance. We mitigate the "obscuring tradeoffs" objection by reporting every component alongside the score and versioning the weights. If you disagree with the weights, PR a different set — we mint InterpScore v0.0.2 with your formula tagged.

Leaderboard

7 entries · sorted by InterpScore desc
#SAEd_sae / ktokensloss_recaliveprobingtppInterpScore
1
google/gemma-2-2bgoogle/gemma-scope-2b-pt-res · L12 (res_post) 65,536 / k=648B0.93099.4%0.8600.7200.871
2
google/gemma-2-9bgoogle/gemma-scope-9b-pt-res · L20 (res_post) 131,072 / k=12816B0.95099.8%0.8800.7700.864
3
Google/Gemma-4-E4Bcaiovicentino1/Gemma-4-E4B-SAE-L21-topk · L21 32,768 / k=1281B0.93997.0%0.7400.6600.805
4
Qwen/Qwen3.6-27Bcaiovicentino1/qwen36-27b-sae-papergrade · L11 65,536 / k=128200M0.99899.9%0.8620.1350.779
5
Qwen/Qwen3.5-4Bcaiovicentino1/Qwen3.5-4B-SAE-L18-topk · L18 40,960 / k=128200M0.86699.0%0.7100.6200.773
6
Qwen/Qwen3.6-27Bcaiovicentino1/qwen36-27b-sae-papergrade · L31 65,536 / k=128200M0.99489.2%0.8670.1170.760
7
Qwen/Qwen3.6-27Bcaiovicentino1/qwen36-27b-sae-papergrade · L55 65,536 / k=128200M0.98878.0%0.8290.2420.751

Score breakdowns

1

google/gemma-2-2b

L12 (res_post) · d_sae=65,536 · k=64
InterpScore
0.871
loss_recovered · ×0.300.930
alive · ×0.150.994
l0_score · ×0.150.800
sparse_probing · ×0.250.860
tpp · ×0.150.720

Reference point. Numbers from Karvonen et al. SAEBench 2025 + Gemma Scope paper.

2

google/gemma-2-9b

L20 (res_post) · d_sae=131,072 · k=128
InterpScore
0.864
loss_recovered · ×0.300.950
alive · ×0.150.998
l0_score · ×0.150.625
sparse_probing · ×0.250.880
tpp · ×0.150.770
3

Google/Gemma-4-E4B

L21 · d_sae=32,768 · k=128
InterpScore
0.805
loss_recovered · ×0.300.939
alive · ×0.150.970
l0_score · ×0.150.625
sparse_probing · ×0.250.740
tpp · ×0.150.660

First public SAE for Gemma-4 ensemble-MoE.

4

Qwen/Qwen3.6-27B

L11 · d_sae=65,536 · k=128
InterpScore
0.779
loss_recovered · ×0.300.998
alive · ×0.150.999
l0_score · ×0.150.625
sparse_probing · ×0.250.862
tpp · ×0.150.135

Best of 3-layer paper-grade release. InterpScore eval: 250k tokens C4, probes toxicity (SetFit/toxic_conversations) + sentiment (sst2), TPP at 0.5% of d_sae (k=327).

Submit your SAE

Run notebook 18_interpscore_eval.ipynb on your SAE — it writes interpscore.json to your HF repo, then open a PR to OpenInterpretability/web adding one line to lib/leaderboard.ts. A Q2 2026 automated ingestion endpoint will accept the JSON URL directly.

Methodology

Source metrics

All 5 components come from SAEBench (Karvonen et al. 2025) + Gao et al. 2024 . No new metric is invented — only the weighted combination.

Why these weights

Reconstruction (0.30) is necessary-not-sufficient. Interpretability (0.25) is the differentiator. Causal faithfulness (0.15) is the honesty check. L0 + alive (0.15 + 0.15) prevent Goodharting toward trivial solutions.

Trust model

Numbers come from user-submitted interpscore.json files on HF SAE repos. Independently verifiable — re-run notebook 18 on any SAE and compare. External entries (Gemma Scope, Anthropic) use published paper numbers with clear attribution.