google/gemma-2-2b
Reference point. Numbers from Karvonen et al. SAEBench 2025 + Gemma Scope paper.
Composite SAE evaluation — one number that summarizes reconstruction, sparsity, dead-feature pressure, interpretability, and causal faithfulness. Transparent weights, editable formula, individual components always reported.
InterpScore = 0.30 × loss_recovered
+ 0.15 × (1 − dead_frac)
+ 0.15 × l0_score # exp(−|log(L0/80)|) · peaks at L0 ≈ 80
+ 0.25 × sparse_probing_auc
+ 0.15 × tpp_scoreWhy a composite when SAEBench authors (Karvonen 2025) explicitly refuse one? Because it makes SAEs comparable at a glance. We mitigate the "obscuring tradeoffs" objection by reporting every component alongside the score and versioning the weights. If you disagree with the weights, PR a different set — we mint InterpScore v0.0.2 with your formula tagged.
| # | SAE | d_sae / k | tokens | loss_rec | alive | probing | tpp | InterpScore |
|---|---|---|---|---|---|---|---|---|
1 | google/gemma-2-2bgoogle/gemma-scope-2b-pt-res · L12 (res_post) | 65,536 / k=64 | 8B | 0.930 | 99.4% | 0.860 | 0.720 | 0.871 |
2 | google/gemma-2-9bgoogle/gemma-scope-9b-pt-res · L20 (res_post) | 131,072 / k=128 | 16B | 0.950 | 99.8% | 0.880 | 0.770 | 0.864 |
3 | Google/Gemma-4-E4Bcaiovicentino1/Gemma-4-E4B-SAE-L21-topk · L21 | 32,768 / k=128 | 1B | 0.939 | 97.0% | 0.740 | 0.660 | 0.805 |
4 | Qwen/Qwen3.6-27Bcaiovicentino1/qwen36-27b-sae-papergrade · L11 | 65,536 / k=128 | 200M | 0.998 | 99.9% | 0.862 | 0.135 | 0.779 |
5 | Qwen/Qwen3.5-4Bcaiovicentino1/Qwen3.5-4B-SAE-L18-topk · L18 | 40,960 / k=128 | 200M | 0.866 | 99.0% | 0.710 | 0.620 | 0.773 |
6 | Qwen/Qwen3.6-27Bcaiovicentino1/qwen36-27b-sae-papergrade · L31 | 65,536 / k=128 | 200M | 0.994 | 89.2% | 0.867 | 0.117 | 0.760 |
7 | Qwen/Qwen3.6-27Bcaiovicentino1/qwen36-27b-sae-papergrade · L55 | 65,536 / k=128 | 200M | 0.988 | 78.0% | 0.829 | 0.242 | 0.751 |
Reference point. Numbers from Karvonen et al. SAEBench 2025 + Gemma Scope paper.
First public SAE for Gemma-4 ensemble-MoE.
Best of 3-layer paper-grade release. InterpScore eval: 250k tokens C4, probes toxicity (SetFit/toxic_conversations) + sentiment (sst2), TPP at 0.5% of d_sae (k=327).
Run notebook 18_interpscore_eval.ipynb on your SAE — it writes interpscore.json to your HF repo, then open a PR to OpenInterpretability/web adding one line to lib/leaderboard.ts. A Q2 2026 automated ingestion endpoint will accept the JSON URL directly.
All 5 components come from SAEBench (Karvonen et al. 2025) + Gao et al. 2024 . No new metric is invented — only the weighted combination.
Reconstruction (0.30) is necessary-not-sufficient. Interpretability (0.25) is the differentiator. Causal faithfulness (0.15) is the honesty check. L0 + alive (0.15 + 0.15) prevent Goodharting toward trivial solutions.
Numbers come from user-submitted interpscore.json files on HF SAE repos. Independently verifiable — re-run notebook 18 on any SAE and compare. External entries (Gemma Scope, Anthropic) use published paper numbers with clear attribution.