LIVE · v0.0.1

InterpScore

Composite SAE evaluation — one number that summarizes reconstruction, sparsity, dead-feature pressure, interpretability, and causal faithfulness. Transparent weights, editable formula, individual components always reported.

Formula · v0.0.1

InterpScore = 0.30 × loss_recovered
            + 0.15 × (1 − dead_frac)
            + 0.15 × l0_score          # exp(−|log(L0/80)|) · peaks at L0 ≈ 80
            + 0.25 × sparse_probing_auc
            + 0.15 × tpp_score

0.30loss_recovered— reconstruction fidelity

0.15alive— 1 − dead feature fraction

0.15l0_score— exp(−|log(L0/80)|) · peaks at L0 ≈ 80

0.25sparse_probing— AUROC on held-out labels (SAEBench)

0.15tpp— Targeted Probe Perturbation (causal faithfulness)

Why a composite when SAEBench authors (Karvonen 2025) explicitly refuse one? Because it makes SAEs comparable at a glance. We mitigate the "obscuring tradeoffs" objection by reporting every component alongside the score and versioning the weights. If you disagree with the weights, PR a different set — we mint InterpScore v0.0.2 with your formula tagged.

Leaderboard

7 entries · sorted by InterpScore desc

#	SAE	d_sae / k	tokens	loss_rec	alive	probing	tpp	InterpScore
1	google/gemma-2-2bgoogle/gemma-scope-2b-pt-res · L12 (res_post)	65,536 / k=64	8B	0.930	99.4%	0.860	0.720	0.871
2	google/gemma-2-9bgoogle/gemma-scope-9b-pt-res · L20 (res_post)	131,072 / k=128	16B	0.950	99.8%	0.880	0.770	0.864
3	Google/Gemma-4-E4Bcaiovicentino1/Gemma-4-E4B-SAE-L21-topk · L21	32,768 / k=128	1B	0.939	97.0%	0.740	0.660	0.805
4	Qwen/Qwen3.6-27Bcaiovicentino1/qwen36-27b-sae-papergrade · L11	65,536 / k=128	200M	0.998	99.9%	0.862	0.135	0.779
5	Qwen/Qwen3.5-4Bcaiovicentino1/Qwen3.5-4B-SAE-L18-topk · L18	40,960 / k=128	200M	0.866	99.0%	0.710	0.620	0.773
6	Qwen/Qwen3.6-27Bcaiovicentino1/qwen36-27b-sae-papergrade · L31	65,536 / k=128	200M	0.994	89.2%	0.867	0.117	0.760
7	Qwen/Qwen3.6-27Bcaiovicentino1/qwen36-27b-sae-papergrade · L55	65,536 / k=128	200M	0.988	78.0%	0.829	0.242	0.751

Score breakdowns

google/gemma-2-2b

L12 (res_post) · d_sae=65,536 · k=64

InterpScore

0.871

loss_recovered · ×0.300.930

alive · ×0.150.994

l0_score · ×0.150.800

sparse_probing · ×0.250.860

tpp · ×0.150.720

Reference point. Numbers from Karvonen et al. SAEBench 2025 + Gemma Scope paper.

google/gemma-2-9b

L20 (res_post) · d_sae=131,072 · k=128

InterpScore

0.864

loss_recovered · ×0.300.950

alive · ×0.150.998

l0_score · ×0.150.625

sparse_probing · ×0.250.880

tpp · ×0.150.770

Google/Gemma-4-E4B

L21 · d_sae=32,768 · k=128

InterpScore

0.805

loss_recovered · ×0.300.939

alive · ×0.150.970

l0_score · ×0.150.625

sparse_probing · ×0.250.740

tpp · ×0.150.660

First public SAE for Gemma-4 ensemble-MoE.

Qwen/Qwen3.6-27B

L11 · d_sae=65,536 · k=128

InterpScore

0.779

loss_recovered · ×0.300.998

alive · ×0.150.999

l0_score · ×0.150.625

sparse_probing · ×0.250.862

tpp · ×0.150.135

Best of 3-layer paper-grade release. InterpScore eval: 250k tokens C4, probes toxicity (SetFit/toxic_conversations) + sentiment (sst2), TPP at 0.5% of d_sae (k=327).

Submit your SAE

Run notebook 18_interpscore_eval.ipynb on your SAE — it writes interpscore.json to your HF repo, then open a PR to OpenInterpretability/web adding one line to lib/leaderboard.ts. A Q2 2026 automated ingestion endpoint will accept the JSON URL directly.

Open the eval notebook Open PR on leaderboard.ts Propose v0.0.2 weights

Methodology

Source metrics

All 5 components come from SAEBench (Karvonen et al. 2025) + Gao et al. 2024 . No new metric is invented — only the weighted combination.

Why these weights

Reconstruction (0.30) is necessary-not-sufficient. Interpretability (0.25) is the differentiator. Causal faithfulness (0.15) is the honesty check. L0 + alive (0.15 + 0.15) prevent Goodharting toward trivial solutions.

Trust model

Numbers come from user-submitted interpscore.json files on HF SAE repos. Independently verifiable — re-run notebook 18 on any SAE and compare. External entries (Gemma Scope, Anthropic) use published paper numbers with clear attribution.